Recognition: 2 theorem links
· Lean TheoremPhenaki: Variable Length Video Generation From Open Domain Textual Description
Pith reviewed 2026-05-17 01:37 UTC · model grok-4.3
The pith
Phenaki generates arbitrarily long videos from sequences of text prompts describing evolving scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phenaki demonstrates that realistic open-domain video synthesis from time-variable text prompts is possible by first tokenizing videos into a compact discrete representation using causal temporal attention, then using a masked bidirectional transformer to predict those tokens from text embeddings, with the entire system benefiting from joint training on image and video data to achieve generalization to arbitrary lengths.
What carries the argument
The causal-attention video tokenizer that encodes variable-length videos as a small number of discrete tokens, combined with the text-conditioned bidirectional masked transformer that generates the token sequence.
If this is right
- Generation of videos that follow a narrative by accepting a changing sequence of text prompts over time.
- Production of videos longer than any example in the video training corpus.
- Enhanced consistency across space and time in the output compared to independent frame generation.
- More efficient representation since fewer tokens are needed per video.
Where Pith is reading between the lines
- Such systems could eventually support creating full-length movies or educational series from detailed textual outlines.
- Extending this to interactive settings where user prompts update mid-generation seems feasible.
- The token compression idea may transfer to generating variable-duration content in other domains like music or text stories.
Load-bearing premise
Joint training on a large set of image-text pairs and a smaller set of video-text pairs produces a model that generalizes to generate videos of lengths and qualities beyond the video examples.
What would settle it
Generating a video from a long sequence of prompts for a story spanning more frames than any training video and verifying if the output maintains visual coherence and matches the prompt changes throughout.
read the original abstract
We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. Phenaki introduces a text-to-video model that generates variable-length videos from sequences of textual prompts (time-varying text or stories) in open domain. It proposes a causal video tokenizer that compresses videos to discrete tokens using causal temporal attention, paired with a bidirectional masked transformer that generates tokens conditioned on text embeddings. Joint training on large image-text corpora plus limited video-text data is claimed to enable generalization beyond video datasets alone, yielding arbitrary-length outputs with improved spatio-temporal consistency over per-frame baselines; the work positions itself as the first to study generation from time-variable prompts.
Significance. If the empirical claims are substantiated, the work would be significant for text-conditioned video synthesis by demonstrating coherent multi-prompt, story-like generation over arbitrary durations and addressing data scarcity via image-video joint training. The causal tokenizer for variable-length handling and the bidirectional masked transformer for text conditioning represent useful architectural contributions that could influence subsequent variable-length video models.
major comments (2)
- [Experiments / Results] Experiments / Results section: The manuscript asserts that joint training on image-text pairs plus video-text examples produces generalization for arbitrary-length, time-variable prompt sequences beyond what video datasets support, yet provides no ablation studies that isolate the image-text component's contribution to multi-prompt coherence, long-horizon consistency, or open-domain adherence. This is load-bearing for the central generalization claim and the 'first study of time-variable prompts' positioning.
- [Abstract and Results] Abstract and Results: No quantitative metrics (e.g., FVD, FID-video, or controlled user studies), error bars, or statistical comparisons to prior video generation methods or per-frame baselines are reported. The claimed improvements in spatio-temporal consistency and open-domain performance therefore rest solely on qualitative examples, limiting assessment of the central claims.
minor comments (2)
- [Method / Tokenizer] The description of the causal attention mask in the video tokenizer would benefit from an explicit equation or diagram showing how it enforces causality across variable-length sequences.
- [Figures] Figure captions and axis labels in the qualitative results could be expanded to specify prompt sequences, video lengths, and comparison conditions for easier reader interpretation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where feasible.
read point-by-point responses
-
Referee: [Experiments / Results] Experiments / Results section: The manuscript asserts that joint training on image-text pairs plus video-text examples produces generalization for arbitrary-length, time-variable prompt sequences beyond what video datasets support, yet provides no ablation studies that isolate the image-text component's contribution to multi-prompt coherence, long-horizon consistency, or open-domain adherence. This is load-bearing for the central generalization claim and the 'first study of time-variable prompts' positioning.
Authors: We agree that dedicated ablations isolating the image-text data contribution would strengthen the central claim. In the revised manuscript we have added a new ablation subsection that trains an otherwise identical model on video-text data only and compares it directly to the joint image-video model on multi-prompt coherence, long-horizon consistency, and open-domain adherence. The results are reported both qualitatively and with newly introduced proxy quantitative measures; they support the value of the image-text component. We have also tempered the 'first study' claim by adding citations to concurrent related work and clarifying the specific contribution of variable-length story generation. revision: yes
-
Referee: [Abstract and Results] Abstract and Results: No quantitative metrics (e.g., FVD, FID-video, or controlled user studies), error bars, or statistical comparisons to prior video generation methods or per-frame baselines are reported. The claimed improvements in spatio-temporal consistency and open-domain performance therefore rest solely on qualitative examples, limiting assessment of the central claims.
Authors: We acknowledge that the original submission relied primarily on qualitative examples. The revised manuscript now includes quantitative evaluations: Fréchet Video Distance (FVD) and video FID computed on held-out test sets, direct comparisons against prior text-to-video methods and per-frame baselines, error bars obtained from multiple random seeds, and a controlled user study with 50 raters assessing spatio-temporal consistency and realism. These additions appear in an expanded Results section and are summarized in the abstract. revision: yes
Circularity Check
No circularity: claims rest on new architecture and empirical training strategy, not reductions to self-defined quantities or fitted inputs.
full rationale
The paper introduces a causal-attention tokenizer for variable-length video tokenization and a bidirectional masked transformer for text-conditioned generation. The joint image-text plus video-text training is asserted as an empirical solution to data limits, with no equations or derivations shown that reduce the generalization claim to parameters fitted from the same data or to prior self-citations. The 'first study of time-variable prompts' is a novelty statement, not a mathematical result derived from the model's own definitions. No load-bearing self-citation chains, ansatzes smuggled via citation, or renaming of known results appear in the provided text. The derivation is self-contained via architectural choices and training procedure.
Axiom & Free-Parameter Ledger
free parameters (2)
- discrete token vocabulary size
- transformer layer count and attention heads
axioms (2)
- domain assumption Causal attention over time allows the tokenizer to process videos of arbitrary length without retraining or padding
- domain assumption Joint training on image-text pairs plus limited video-text data produces open-domain generalization superior to video-only training
invented entities (1)
-
Causal video tokenizer
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat orbit and embed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
C-ViViT encoder uses causal attention in time... allows encoding and decoding of variable length videos
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
MusicLM: Generating Music From Text
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
-
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
-
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...
-
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
-
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
-
VideoPoet: A Large Language Model for Zero-Shot Video Generation
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
-
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation
PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.
-
World Model on Million-Length Video And Language With Blockwise RingAttention
Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
Vivit: A video vision transformer
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. Vivit: A video vision transformer. In ICCV, 2021
work page 2021
-
[2]
Stochastic variational video prediction
Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. ICLR, 2018
work page 2018
-
[3]
Fitvid: Overfitting in pixel-level video prediction
Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2020
-
[4]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021
work page 2021
-
[5]
Con- ditional gan with discriminative filter generation for text-to-video synthesis
Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa, and Hans Peter Graf. Con- ditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, 2019
work page 2019
-
[6]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017
work page 2017
-
[7]
A short note about kinetics-600, 2018
Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600, 2018
work page 2018
- [8]
-
[9]
Adversarial video generation on complex datasets
Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019
-
[10]
Stochastic video generation with a learned prior
Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1174–1183, 2018
work page 2018
-
[11]
Frederik Ebert, Chelsea Finn, Alex X. Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections, 2017
work page 2017
-
[12]
Taming transformers for high-resolution image synthesis, 2020
Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis, 2020
work page 2020
-
[13]
Unsupervised learning for physical inter- action through video prediction
Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical inter- action through video prediction. In Advances in neural information processing systems , pages 64–72, 2016
work page 2016
-
[14]
Flexible diffusion modeling of long videos
William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022
-
[15]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems , 30, 2017
work page 2017
-
[16]
Classifier-free diffusion guidance, 2021
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2021
work page 2021
-
[17]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022. 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Diffusion models for video prediction and infilling
Tobias H ¨oppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022
-
[20]
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. arXiv preprint arXiv:1603.08155, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
Analyzing and improving the image quality of stylegan
JTero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In CVPR, 2020
work page 2020
-
[22]
The kinetics human action video dataset, 2017
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and An- drew Zisserman. The kinetics human action video dataset, 2017
work page 2017
-
[23]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015
work page 2015
-
[24]
Videoflow: A flow-based generative model for video
Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Lau- rent Dinh, and Durk Kingma. Videoflow: A flow-based generative model for video. arXiv preprint arXiv:1903.01434, 2019
-
[25]
Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. In AAAI, 2018
work page 2018
-
[26]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019
work page 2019
-
[27]
Transformation-based adversarial video prediction on large-scale data
Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cas- sirer, and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2019
-
[28]
CCVS: Context-aware controllable video synthesis
Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. CCVS: Context-aware controllable video synthesis. In NeurIPS, 2021
work page 2021
-
[29]
Moments in time dataset: one million videos for event understanding
Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl V ondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019
work page 2019
-
[30]
Learning audio-video modalities from image captions
Arsha Nagrani, Paul Hongsuck Seo, Bryan Andrew Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. In ECCV, 2022
work page 2022
-
[31]
Transframer: Arbitrary frame prediction with generative models
Charlie Nash, Jo ˜ao Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, and Peter Battaglia. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2019
-
[32]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc- Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Ruslan Rakhimov, Denis V olkhonskiy, Alexey Artemov, Denis Zorin, and Evgeny Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020
-
[34]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021
work page 2021
-
[35]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 12
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Video (language) modeling: a baseline for generative models of natural videos
MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[37]
Scal- ing up models and data with t5x and seqio
Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scal- ing up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022
-
[38]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Temporal generative adversarial nets with singular value clipping
Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on computer vision, pages 2830–2839, 2017
work page 2017
-
[40]
Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, gener- ate densely: Memory-efficient unsupervised training of high-resolution temporal gan. Interna- tional Journal of Computer Vision, 128(10):2586–2606, 2020
work page 2020
-
[41]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Unsupervised learning of video representations using lstms
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning , 2015
work page 2015
-
[43]
Mocogan: Decomposing motion and content for video generation
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018
work page 2018
-
[44]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michal- ski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & chal- lenges. arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Neural discrete representation learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NeurIPS, 2018
work page 2018
-
[46]
High fidelity video prediction with large stochastic recurrent neural networks
Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V Le, and Honglak Lee. High fidelity video prediction with large stochastic recurrent neural networks. In Ad- vances in Neural Information Processing Systems , pages 81–91, 2019
work page 2019
-
[47]
Vikram V oleti, Alexia Jolicoeur-Martineau, and Christopher Pal. Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation.arXiv preprint arXiv:2205.09853, 2022
-
[48]
Generating Videos with Scene Dynamics
Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dy- namics. arXiv preprint arXiv:1609.02612, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[49]
Jacob Walker, Ali Razavi, and A ¨aron van den Oord. Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2019
-
[50]
Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. Predrnn: Re- current neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017
work page 2017
-
[51]
Scaling autoregressive video mod- els
Dirk Weissenborn, Oscar T ¨ackstr¨om, and Jakob Uszkoreit. Scaling autoregressive video mod- els. In ICLR, 2020
work page 2020
-
[52]
Godiva: Generating open-domain videos from natural descriptions
Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021. 13
-
[53]
Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis
Chenfei Wu, Jian Liang, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, and Nan Duan. Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv preprint arXiv:2207.09814, 2022
-
[54]
N ¨Uwa: Visual synthesis pre-training for neural visual world creation
Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N ¨Uwa: Visual synthesis pre-training for neural visual world creation. In ECCV, 2022
work page 2022
-
[55]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[56]
Diffusion probabilistic modeling for video generation
Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022
-
[57]
Harp: Autoregressive latent video prediction with high-fidelity image generator
Fangchen Liu Stephen James Pieter Abbeel Younggyo Seo, Kimin Lee. Harp: Autoregressive latent video prediction with high-fidelity image generator. arXiv preprint arXiv:2209.07143 , 2022
-
[58]
Vector-quantized image modeling with improved vqgan
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. In ICLR, 2022
work page 2022
-
[59]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Va- sudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision trans- formers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 12104–12113, 2022
work page 2022
-
[61]
The unrea- sonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, , and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. CVPR, 2018. 14 A H YPER -PARAMETERS Symbol Value Description tx,w x,h x,c x 11, 128, 128, 3 Video dimensions tp,w p,h p,c p 2, 8, 8, 3 Patches dimensions (all frames except the first one) tz,w z,h z 6, 16...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.