arxiv: 2210.02399 · v1 · submitted 2022-10-05 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas , Mohammad Babaeizadeh , Pieter-Jan Kindermans , Hernan Moraldo , Han Zhang , Mohammad Taghi Saffar , Santiago Castro , Julius Kunze

show 1 more author

Dumitru Erhan

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video synthesistext conditioned generationvariable lengthdiscrete tokenscausal attentionmasked transformerjoint training

0 comments

The pith

Phenaki generates arbitrarily long videos from sequences of text prompts describing evolving scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Phenaki for synthesizing realistic videos from a sequence of textual prompts that can vary over time. It addresses computational costs, limited video data, and variable video lengths by compressing videos into discrete tokens with a causal attention tokenizer. A bidirectional masked transformer then generates these tokens conditioned on text, which are decoded into video. Joint training on extensive image-text pairs alongside video-text examples allows the model to create longer and more coherent videos than those in the video datasets alone. This results in videos with improved spatio-temporal consistency using fewer tokens than per-frame approaches.

Core claim

Phenaki demonstrates that realistic open-domain video synthesis from time-variable text prompts is possible by first tokenizing videos into a compact discrete representation using causal temporal attention, then using a masked bidirectional transformer to predict those tokens from text embeddings, with the entire system benefiting from joint training on image and video data to achieve generalization to arbitrary lengths.

What carries the argument

The causal-attention video tokenizer that encodes variable-length videos as a small number of discrete tokens, combined with the text-conditioned bidirectional masked transformer that generates the token sequence.

If this is right

Generation of videos that follow a narrative by accepting a changing sequence of text prompts over time.
Production of videos longer than any example in the video training corpus.
Enhanced consistency across space and time in the output compared to independent frame generation.
More efficient representation since fewer tokens are needed per video.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such systems could eventually support creating full-length movies or educational series from detailed textual outlines.
Extending this to interactive settings where user prompts update mid-generation seems feasible.
The token compression idea may transfer to generating variable-duration content in other domains like music or text stories.

Load-bearing premise

Joint training on a large set of image-text pairs and a smaller set of video-text pairs produces a model that generalizes to generate videos of lengths and qualities beyond the video examples.

What would settle it

Generating a video from a long sequence of prompts for a story spanning more frames than any training video and verifying if the output maintains visual coherence and matches the prompt changes throughout.

read the original abstract

We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Phenaki adds a causal temporal tokenizer and masked transformer to generate variable-length videos from sequences of changing text prompts, with joint image-video training, but the quantitative backing for the generalization claims is missing.

read the letter

The paper's real contribution is showing how to condition video generation on a story-like sequence of prompts rather than one fixed description. They compress videos into discrete tokens with a causal-attention tokenizer that handles arbitrary lengths, then use a bidirectional masked transformer to predict those tokens from text embeddings, and decode back to video. Joint training on large image-text data plus limited video-text pairs is presented as the way to get open-domain behavior beyond what video datasets alone provide. Compared to per-frame baselines, they report fewer tokens per video and better spatio-temporal consistency in the qualitative examples.

Referee Report

2 major / 2 minor

Summary. Phenaki introduces a text-to-video model that generates variable-length videos from sequences of textual prompts (time-varying text or stories) in open domain. It proposes a causal video tokenizer that compresses videos to discrete tokens using causal temporal attention, paired with a bidirectional masked transformer that generates tokens conditioned on text embeddings. Joint training on large image-text corpora plus limited video-text data is claimed to enable generalization beyond video datasets alone, yielding arbitrary-length outputs with improved spatio-temporal consistency over per-frame baselines; the work positions itself as the first to study generation from time-variable prompts.

Significance. If the empirical claims are substantiated, the work would be significant for text-conditioned video synthesis by demonstrating coherent multi-prompt, story-like generation over arbitrary durations and addressing data scarcity via image-video joint training. The causal tokenizer for variable-length handling and the bidirectional masked transformer for text conditioning represent useful architectural contributions that could influence subsequent variable-length video models.

major comments (2)

[Experiments / Results] Experiments / Results section: The manuscript asserts that joint training on image-text pairs plus video-text examples produces generalization for arbitrary-length, time-variable prompt sequences beyond what video datasets support, yet provides no ablation studies that isolate the image-text component's contribution to multi-prompt coherence, long-horizon consistency, or open-domain adherence. This is load-bearing for the central generalization claim and the 'first study of time-variable prompts' positioning.
[Abstract and Results] Abstract and Results: No quantitative metrics (e.g., FVD, FID-video, or controlled user studies), error bars, or statistical comparisons to prior video generation methods or per-frame baselines are reported. The claimed improvements in spatio-temporal consistency and open-domain performance therefore rest solely on qualitative examples, limiting assessment of the central claims.

minor comments (2)

[Method / Tokenizer] The description of the causal attention mask in the video tokenizer would benefit from an explicit equation or diagram showing how it enforces causality across variable-length sequences.
[Figures] Figure captions and axis labels in the qualitative results could be expanded to specify prompt sequences, video lengths, and comparison conditions for easier reader interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Experiments / Results] Experiments / Results section: The manuscript asserts that joint training on image-text pairs plus video-text examples produces generalization for arbitrary-length, time-variable prompt sequences beyond what video datasets support, yet provides no ablation studies that isolate the image-text component's contribution to multi-prompt coherence, long-horizon consistency, or open-domain adherence. This is load-bearing for the central generalization claim and the 'first study of time-variable prompts' positioning.

Authors: We agree that dedicated ablations isolating the image-text data contribution would strengthen the central claim. In the revised manuscript we have added a new ablation subsection that trains an otherwise identical model on video-text data only and compares it directly to the joint image-video model on multi-prompt coherence, long-horizon consistency, and open-domain adherence. The results are reported both qualitatively and with newly introduced proxy quantitative measures; they support the value of the image-text component. We have also tempered the 'first study' claim by adding citations to concurrent related work and clarifying the specific contribution of variable-length story generation. revision: yes
Referee: [Abstract and Results] Abstract and Results: No quantitative metrics (e.g., FVD, FID-video, or controlled user studies), error bars, or statistical comparisons to prior video generation methods or per-frame baselines are reported. The claimed improvements in spatio-temporal consistency and open-domain performance therefore rest solely on qualitative examples, limiting assessment of the central claims.

Authors: We acknowledge that the original submission relied primarily on qualitative examples. The revised manuscript now includes quantitative evaluations: Fréchet Video Distance (FVD) and video FID computed on held-out test sets, direct comparisons against prior text-to-video methods and per-frame baselines, error bars obtained from multiple random seeds, and a controlled user study with 50 raters assessing spatio-temporal consistency and realism. These additions appear in an expanded Results section and are summarized in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new architecture and empirical training strategy, not reductions to self-defined quantities or fitted inputs.

full rationale

The paper introduces a causal-attention tokenizer for variable-length video tokenization and a bidirectional masked transformer for text-conditioned generation. The joint image-text plus video-text training is asserted as an empirical solution to data limits, with no equations or derivations shown that reduce the generalization claim to parameters fitted from the same data or to prior self-citations. The 'first study of time-variable prompts' is a novelty statement, not a mathematical result derived from the model's own definitions. No load-bearing self-citation chains, ansatzes smuggled via citation, or renaming of known results appear in the provided text. The derivation is self-contained via architectural choices and training procedure.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claims depend on the effectiveness of the causal tokenizer and the joint image-video training strategy, both of which rest on untested architectural and data assumptions rather than external benchmarks or formal derivations.

free parameters (2)

discrete token vocabulary size
Hyperparameter chosen to balance compression rate against reconstruction quality for the video tokenizer.
transformer layer count and attention heads
Architectural scale parameters selected for the bidirectional masked generator.

axioms (2)

domain assumption Causal attention over time allows the tokenizer to process videos of arbitrary length without retraining or padding
Invoked to justify variable-length capability in the video representation model.
domain assumption Joint training on image-text pairs plus limited video-text data produces open-domain generalization superior to video-only training
Used to address the scarcity of high-quality text-video data.

invented entities (1)

Causal video tokenizer no independent evidence
purpose: Compresses variable-length videos into discrete tokens while preserving spatio-temporal structure
New component introduced to handle the variable-length requirement.

pith-pipeline@v0.9.0 · 5560 in / 1581 out tokens · 86460 ms · 2026-05-17T01:37:51.649559+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat orbit and embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

C-ViViT encoder uses causal attention in time... allows encoding and decoding of variable length videos
IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
cs.CV 2026-05 unverdicted novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
cs.CV 2026-02 unverdicted novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
cs.CV 2025-12 conditional novelty 6.0

Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
VideoPoet: A Large Language Model for Zero-Shot Video Generation
cs.CV 2023-12 unverdicted novelty 6.0

VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.
World Model on Million-Length Video And Language With Blockwise RingAttention
cs.LG 2024-02 unverdicted novelty 5.0

Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 17 Pith papers · 12 internal anchors

[1]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. Vivit: A video vision transformer. In ICCV, 2021

work page 2021
[2]

Stochastic variational video prediction

Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. ICLR, 2018

work page 2018
[3]

Fitvid: Overﬁtting in pixel-level video prediction

Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. Fitvid: Overﬁtting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2020

work page arXiv 2020
[4]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021

work page 2021
[5]

Con- ditional gan with discriminative ﬁlter generation for text-to-video synthesis

Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa, and Hans Peter Graf. Con- ditional gan with discriminative ﬁlter generation for text-to-video synthesis. In IJCAI, 2019

work page 2019
[6]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

work page 2017
[7]

A short note about kinetics-600, 2018

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600, 2018

work page 2018
[8]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. arXiv preprint arXiv:2202.04200, 2022

work page arXiv 2022
[9]

Adversarial video generation on complex datasets

Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019

work page arXiv 1907
[10]

Stochastic video generation with a learned prior

Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1174–1183, 2018

work page 2018
[11]

Lee, and Sergey Levine

Frederik Ebert, Chelsea Finn, Alex X. Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections, 2017

work page 2017
[12]

Taming transformers for high-resolution image synthesis, 2020

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis, 2020

work page 2020
[13]

Unsupervised learning for physical inter- action through video prediction

Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical inter- action through video prediction. In Advances in neural information processing systems , pages 64–72, 2016

work page 2016
[14]

Flexible diffusion modeling of long videos

William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022

work page arXiv 2022
[15]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems , 30, 2017

work page 2017
[16]

Classiﬁer-free diffusion guidance, 2021

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance, 2021

work page 2021
[17]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Diffusion models for video prediction and inﬁlling

Tobias H ¨oppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and inﬁlling. arXiv preprint arXiv:2206.07696, 2022

work page arXiv 2022
[20]

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. arXiv preprint arXiv:1603.08155, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Analyzing and improving the image quality of stylegan

JTero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In CVPR, 2020

work page 2020
[22]

The kinetics human action video dataset, 2017

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and An- drew Zisserman. The kinetics human action video dataset, 2017

work page 2017
[23]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015

work page 2015
[24]

Videoﬂow: A ﬂow-based generative model for video

Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Lau- rent Dinh, and Durk Kingma. Videoﬂow: A ﬂow-based generative model for video. arXiv preprint arXiv:1903.01434, 2019

work page arXiv 1903
[25]

Video generation from text

Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. In AAAI, 2018

work page 2018
[26]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019

work page 2019
[27]

Transformation-based adversarial video prediction on large-scale data

Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cas- sirer, and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2019

work page arXiv 2003
[28]

CCVS: Context-aware controllable video synthesis

Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. CCVS: Context-aware controllable video synthesis. In NeurIPS, 2021

work page 2021
[29]

Moments in time dataset: one million videos for event understanding

Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl V ondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

work page 2019
[30]

Learning audio-video modalities from image captions

Arsha Nagrani, Paul Hongsuck Seo, Bryan Andrew Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. In ECCV, 2022

work page 2022
[31]

Transframer: Arbitrary frame prediction with generative models

Charlie Nash, Jo ˜ao Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, and Peter Battaglia. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2019

work page arXiv 2019
[32]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc- Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Latent video transformer

Ruslan Rakhimov, Denis V olkhonskiy, Alexey Artemov, Denis Zorin, and Evgeny Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020

work page arXiv 2006
[34]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021

work page 2021
[35]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 12

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Video (language) modeling: a baseline for generative models of natural videos

MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

Scal- ing up models and data with t5x and seqio

Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scal- ing up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022

work page arXiv 2022
[38]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Temporal generative adversarial nets with singular value clipping

Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on computer vision, pages 2830–2839, 2017

work page 2017
[40]

Train sparsely, gener- ate densely: Memory-efﬁcient unsupervised training of high-resolution temporal gan

Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, gener- ate densely: Memory-efﬁcient unsupervised training of high-resolution temporal gan. Interna- tional Journal of Computer Vision, 128(10):2586–2606, 2020

work page 2020
[41]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-ﬁltered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

Unsupervised learning of video representations using lstms

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning , 2015

work page 2015
[43]

Mocogan: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

work page 2018
[44]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michal- ski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & chal- lenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NeurIPS, 2018

work page 2018
[46]

High ﬁdelity video prediction with large stochastic recurrent neural networks

Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V Le, and Honglak Lee. High ﬁdelity video prediction with large stochastic recurrent neural networks. In Ad- vances in Neural Information Processing Systems , pages 81–91, 2019

work page 2019
[47]

Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation.arXiv preprint arXiv:2205.09853, 2022

Vikram V oleti, Alexia Jolicoeur-Martineau, and Christopher Pal. Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation.arXiv preprint arXiv:2205.09853, 2022

work page arXiv 2022
[48]

Generating Videos with Scene Dynamics

Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dy- namics. arXiv preprint arXiv:1609.02612, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[49]

Predicting video with vqvae

Jacob Walker, Ali Razavi, and A ¨aron van den Oord. Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2019

work page arXiv 2019
[50]

Predrnn: Re- current neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. Predrnn: Re- current neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

work page 2017
[51]

Scaling autoregressive video mod- els

Dirk Weissenborn, Oscar T ¨ackstr¨om, and Jakob Uszkoreit. Scaling autoregressive video mod- els. In ICLR, 2020

work page 2020
[52]

Godiva: Generating open-domain videos from natural descriptions

Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021. 13

work page arXiv 2021
[53]

Nuwa-inﬁnity: Autoregressive over autoregressive generation for inﬁnite visual synthesis

Chenfei Wu, Jian Liang, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, and Nan Duan. Nuwa-inﬁnity: Autoregressive over autoregressive generation for inﬁnite visual synthesis. arXiv preprint arXiv:2207.09814, 2022

work page arXiv 2022
[54]

N ¨Uwa: Visual synthesis pre-training for neural visual world creation

Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N ¨Uwa: Visual synthesis pre-training for neural visual world creation. In ECCV, 2022

work page 2022
[55]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[56]

Diffusion probabilistic modeling for video generation

Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022

work page arXiv 2022
[57]

Harp: Autoregressive latent video prediction with high-ﬁdelity image generator

Fangchen Liu Stephen James Pieter Abbeel Younggyo Seo, Kimin Lee. Harp: Autoregressive latent video prediction with high-ﬁdelity image generator. arXiv preprint arXiv:2209.07143 , 2022

work page arXiv 2022
[58]

Vector-quantized image modeling with improved vqgan

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. In ICLR, 2022

work page 2022
[59]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Va- sudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

Scaling vision trans- formers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision trans- formers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 12104–12113, 2022

work page 2022
[61]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, , and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. CVPR, 2018. 14 A H YPER -PARAMETERS Symbol Value Description tx,w x,h x,c x 11, 128, 128, 3 Video dimensions tp,w p,h p,c p 2, 8, 8, 3 Patches dimensions (all frames except the ﬁrst one) tz,w z,h z 6, 16...

work page 2018