pith. machine review for the scientific record. sign in

arxiv: 2210.02399 · v1 · submitted 2022-10-05 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video synthesistext conditioned generationvariable lengthdiscrete tokenscausal attentionmasked transformerjoint training
0
0 comments X

The pith

Phenaki generates arbitrarily long videos from sequences of text prompts describing evolving scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Phenaki for synthesizing realistic videos from a sequence of textual prompts that can vary over time. It addresses computational costs, limited video data, and variable video lengths by compressing videos into discrete tokens with a causal attention tokenizer. A bidirectional masked transformer then generates these tokens conditioned on text, which are decoded into video. Joint training on extensive image-text pairs alongside video-text examples allows the model to create longer and more coherent videos than those in the video datasets alone. This results in videos with improved spatio-temporal consistency using fewer tokens than per-frame approaches.

Core claim

Phenaki demonstrates that realistic open-domain video synthesis from time-variable text prompts is possible by first tokenizing videos into a compact discrete representation using causal temporal attention, then using a masked bidirectional transformer to predict those tokens from text embeddings, with the entire system benefiting from joint training on image and video data to achieve generalization to arbitrary lengths.

What carries the argument

The causal-attention video tokenizer that encodes variable-length videos as a small number of discrete tokens, combined with the text-conditioned bidirectional masked transformer that generates the token sequence.

If this is right

  • Generation of videos that follow a narrative by accepting a changing sequence of text prompts over time.
  • Production of videos longer than any example in the video training corpus.
  • Enhanced consistency across space and time in the output compared to independent frame generation.
  • More efficient representation since fewer tokens are needed per video.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such systems could eventually support creating full-length movies or educational series from detailed textual outlines.
  • Extending this to interactive settings where user prompts update mid-generation seems feasible.
  • The token compression idea may transfer to generating variable-duration content in other domains like music or text stories.

Load-bearing premise

Joint training on a large set of image-text pairs and a smaller set of video-text pairs produces a model that generalizes to generate videos of lengths and qualities beyond the video examples.

What would settle it

Generating a video from a long sequence of prompts for a story spanning more frames than any training video and verifying if the output maintains visual coherence and matches the prompt changes throughout.

read the original abstract

We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. Phenaki introduces a text-to-video model that generates variable-length videos from sequences of textual prompts (time-varying text or stories) in open domain. It proposes a causal video tokenizer that compresses videos to discrete tokens using causal temporal attention, paired with a bidirectional masked transformer that generates tokens conditioned on text embeddings. Joint training on large image-text corpora plus limited video-text data is claimed to enable generalization beyond video datasets alone, yielding arbitrary-length outputs with improved spatio-temporal consistency over per-frame baselines; the work positions itself as the first to study generation from time-variable prompts.

Significance. If the empirical claims are substantiated, the work would be significant for text-conditioned video synthesis by demonstrating coherent multi-prompt, story-like generation over arbitrary durations and addressing data scarcity via image-video joint training. The causal tokenizer for variable-length handling and the bidirectional masked transformer for text conditioning represent useful architectural contributions that could influence subsequent variable-length video models.

major comments (2)
  1. [Experiments / Results] Experiments / Results section: The manuscript asserts that joint training on image-text pairs plus video-text examples produces generalization for arbitrary-length, time-variable prompt sequences beyond what video datasets support, yet provides no ablation studies that isolate the image-text component's contribution to multi-prompt coherence, long-horizon consistency, or open-domain adherence. This is load-bearing for the central generalization claim and the 'first study of time-variable prompts' positioning.
  2. [Abstract and Results] Abstract and Results: No quantitative metrics (e.g., FVD, FID-video, or controlled user studies), error bars, or statistical comparisons to prior video generation methods or per-frame baselines are reported. The claimed improvements in spatio-temporal consistency and open-domain performance therefore rest solely on qualitative examples, limiting assessment of the central claims.
minor comments (2)
  1. [Method / Tokenizer] The description of the causal attention mask in the video tokenizer would benefit from an explicit equation or diagram showing how it enforces causality across variable-length sequences.
  2. [Figures] Figure captions and axis labels in the qualitative results could be expanded to specify prompt sequences, video lengths, and comparison conditions for easier reader interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Experiments / Results] Experiments / Results section: The manuscript asserts that joint training on image-text pairs plus video-text examples produces generalization for arbitrary-length, time-variable prompt sequences beyond what video datasets support, yet provides no ablation studies that isolate the image-text component's contribution to multi-prompt coherence, long-horizon consistency, or open-domain adherence. This is load-bearing for the central generalization claim and the 'first study of time-variable prompts' positioning.

    Authors: We agree that dedicated ablations isolating the image-text data contribution would strengthen the central claim. In the revised manuscript we have added a new ablation subsection that trains an otherwise identical model on video-text data only and compares it directly to the joint image-video model on multi-prompt coherence, long-horizon consistency, and open-domain adherence. The results are reported both qualitatively and with newly introduced proxy quantitative measures; they support the value of the image-text component. We have also tempered the 'first study' claim by adding citations to concurrent related work and clarifying the specific contribution of variable-length story generation. revision: yes

  2. Referee: [Abstract and Results] Abstract and Results: No quantitative metrics (e.g., FVD, FID-video, or controlled user studies), error bars, or statistical comparisons to prior video generation methods or per-frame baselines are reported. The claimed improvements in spatio-temporal consistency and open-domain performance therefore rest solely on qualitative examples, limiting assessment of the central claims.

    Authors: We acknowledge that the original submission relied primarily on qualitative examples. The revised manuscript now includes quantitative evaluations: Fréchet Video Distance (FVD) and video FID computed on held-out test sets, direct comparisons against prior text-to-video methods and per-frame baselines, error bars obtained from multiple random seeds, and a controlled user study with 50 raters assessing spatio-temporal consistency and realism. These additions appear in an expanded Results section and are summarized in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new architecture and empirical training strategy, not reductions to self-defined quantities or fitted inputs.

full rationale

The paper introduces a causal-attention tokenizer for variable-length video tokenization and a bidirectional masked transformer for text-conditioned generation. The joint image-text plus video-text training is asserted as an empirical solution to data limits, with no equations or derivations shown that reduce the generalization claim to parameters fitted from the same data or to prior self-citations. The 'first study of time-variable prompts' is a novelty statement, not a mathematical result derived from the model's own definitions. No load-bearing self-citation chains, ansatzes smuggled via citation, or renaming of known results appear in the provided text. The derivation is self-contained via architectural choices and training procedure.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claims depend on the effectiveness of the causal tokenizer and the joint image-video training strategy, both of which rest on untested architectural and data assumptions rather than external benchmarks or formal derivations.

free parameters (2)
  • discrete token vocabulary size
    Hyperparameter chosen to balance compression rate against reconstruction quality for the video tokenizer.
  • transformer layer count and attention heads
    Architectural scale parameters selected for the bidirectional masked generator.
axioms (2)
  • domain assumption Causal attention over time allows the tokenizer to process videos of arbitrary length without retraining or padding
    Invoked to justify variable-length capability in the video representation model.
  • domain assumption Joint training on image-text pairs plus limited video-text data produces open-domain generalization superior to video-only training
    Used to address the scarcity of high-quality text-video data.
invented entities (1)
  • Causal video tokenizer no independent evidence
    purpose: Compresses variable-length videos into discrete tokens while preserving spatio-temporal structure
    New component introduced to handle the variable-length requirement.

pith-pipeline@v0.9.0 · 5560 in / 1581 out tokens · 86460 ms · 2026-05-17T01:37:51.649559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MusicLM: Generating Music From Text

    cs.SD 2023-01 conditional novelty 8.0

    MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

  2. TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

  3. DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

  4. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  5. Learning Interactive Real-World Simulators

    cs.AI 2023-10 conditional novelty 7.0

    UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

  6. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  7. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  8. SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...

  9. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

  10. Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    cs.CV 2025-12 conditional novelty 6.0

    Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.

  11. VideoPoet: A Large Language Model for Zero-Shot Video Generation

    cs.CV 2023-12 unverdicted novelty 6.0

    VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.

  12. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  13. Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 5.0

    Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.

  14. ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.

  15. Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.

  16. World Model on Million-Length Video And Language With Blockwise RingAttention

    cs.LG 2024-02 unverdicted novelty 5.0

    Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.

  17. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 17 Pith papers · 12 internal anchors

  1. [1]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. Vivit: A video vision transformer. In ICCV, 2021

  2. [2]

    Stochastic variational video prediction

    Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. ICLR, 2018

  3. [3]

    Fitvid: Overfitting in pixel-level video prediction

    Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2020

  4. [4]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021

  5. [5]

    Con- ditional gan with discriminative filter generation for text-to-video synthesis

    Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa, and Hans Peter Graf. Con- ditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, 2019

  6. [6]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

  7. [7]

    A short note about kinetics-600, 2018

    Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600, 2018

  8. [8]

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. arXiv preprint arXiv:2202.04200, 2022

  9. [9]

    Adversarial video generation on complex datasets

    Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019

  10. [10]

    Stochastic video generation with a learned prior

    Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1174–1183, 2018

  11. [11]

    Lee, and Sergey Levine

    Frederik Ebert, Chelsea Finn, Alex X. Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections, 2017

  12. [12]

    Taming transformers for high-resolution image synthesis, 2020

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis, 2020

  13. [13]

    Unsupervised learning for physical inter- action through video prediction

    Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical inter- action through video prediction. In Advances in neural information processing systems , pages 64–72, 2016

  14. [14]

    Flexible diffusion modeling of long videos

    William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022

  15. [15]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems , 30, 2017

  16. [16]

    Classifier-free diffusion guidance, 2021

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2021

  17. [17]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022. 11

  18. [18]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 , 2022

  19. [19]

    Diffusion models for video prediction and infilling

    Tobias H ¨oppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022

  20. [20]

    Perceptual Losses for Real-Time Style Transfer and Super-Resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. arXiv preprint arXiv:1603.08155, 2016

  21. [21]

    Analyzing and improving the image quality of stylegan

    JTero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In CVPR, 2020

  22. [22]

    The kinetics human action video dataset, 2017

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and An- drew Zisserman. The kinetics human action video dataset, 2017

  23. [23]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015

  24. [24]

    Videoflow: A flow-based generative model for video

    Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Lau- rent Dinh, and Durk Kingma. Videoflow: A flow-based generative model for video. arXiv preprint arXiv:1903.01434, 2019

  25. [25]

    Video generation from text

    Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. In AAAI, 2018

  26. [26]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019

  27. [27]

    Transformation-based adversarial video prediction on large-scale data

    Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cas- sirer, and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2019

  28. [28]

    CCVS: Context-aware controllable video synthesis

    Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. CCVS: Context-aware controllable video synthesis. In NeurIPS, 2021

  29. [29]

    Moments in time dataset: one million videos for event understanding

    Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl V ondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

  30. [30]

    Learning audio-video modalities from image captions

    Arsha Nagrani, Paul Hongsuck Seo, Bryan Andrew Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. In ECCV, 2022

  31. [31]

    Transframer: Arbitrary frame prediction with generative models

    Charlie Nash, Jo ˜ao Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, and Peter Battaglia. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2019

  32. [32]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc- Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

  33. [33]

    Latent video transformer

    Ruslan Rakhimov, Denis V olkhonskiy, Alexey Artemov, Denis Zorin, and Evgeny Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020

  34. [34]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021

  35. [35]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 12

  36. [36]

    Video (language) modeling: a baseline for generative models of natural videos

    MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014

  37. [37]

    Scal- ing up models and data with t5x and seqio

    Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scal- ing up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022

  38. [38]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022

  39. [39]

    Temporal generative adversarial nets with singular value clipping

    Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on computer vision, pages 2830–2839, 2017

  40. [40]

    Train sparsely, gener- ate densely: Memory-efficient unsupervised training of high-resolution temporal gan

    Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, gener- ate densely: Memory-efficient unsupervised training of high-resolution temporal gan. Interna- tional Journal of Computer Vision, 128(10):2586–2606, 2020

  41. [41]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

  42. [42]

    Unsupervised learning of video representations using lstms

    Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning , 2015

  43. [43]

    Mocogan: Decomposing motion and content for video generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

  44. [44]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michal- ski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & chal- lenges. arXiv preprint arXiv:1812.01717, 2018

  45. [45]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NeurIPS, 2018

  46. [46]

    High fidelity video prediction with large stochastic recurrent neural networks

    Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V Le, and Honglak Lee. High fidelity video prediction with large stochastic recurrent neural networks. In Ad- vances in Neural Information Processing Systems , pages 81–91, 2019

  47. [47]

    Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation.arXiv preprint arXiv:2205.09853, 2022

    Vikram V oleti, Alexia Jolicoeur-Martineau, and Christopher Pal. Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation.arXiv preprint arXiv:2205.09853, 2022

  48. [48]

    Generating Videos with Scene Dynamics

    Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dy- namics. arXiv preprint arXiv:1609.02612, 2016

  49. [49]

    Predicting video with vqvae

    Jacob Walker, Ali Razavi, and A ¨aron van den Oord. Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2019

  50. [50]

    Predrnn: Re- current neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

    Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. Predrnn: Re- current neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

  51. [51]

    Scaling autoregressive video mod- els

    Dirk Weissenborn, Oscar T ¨ackstr¨om, and Jakob Uszkoreit. Scaling autoregressive video mod- els. In ICLR, 2020

  52. [52]

    Godiva: Generating open-domain videos from natural descriptions

    Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021. 13

  53. [53]

    Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis

    Chenfei Wu, Jian Liang, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, and Nan Duan. Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv preprint arXiv:2207.09814, 2022

  54. [54]

    N ¨Uwa: Visual synthesis pre-training for neural visual world creation

    Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N ¨Uwa: Visual synthesis pre-training for neural visual world creation. In ECCV, 2022

  55. [55]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2019

  56. [56]

    Diffusion probabilistic modeling for video generation

    Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022

  57. [57]

    Harp: Autoregressive latent video prediction with high-fidelity image generator

    Fangchen Liu Stephen James Pieter Abbeel Younggyo Seo, Kimin Lee. Harp: Autoregressive latent video prediction with high-fidelity image generator. arXiv preprint arXiv:2209.07143 , 2022

  58. [58]

    Vector-quantized image modeling with improved vqgan

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. In ICLR, 2022

  59. [59]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Va- sudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022

  60. [60]

    Scaling vision trans- formers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision trans- formers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 12104–12113, 2022

  61. [61]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, , and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. CVPR, 2018. 14 A H YPER -PARAMETERS Symbol Value Description tx,w x,h x,c x 11, 128, 128, 3 Video dimensions tp,w p,h p,c p 2, 8, 8, 3 Patches dimensions (all frames except the first one) tz,w z,h z 6, 16...