arxiv: 2205.15868 · v1 · submitted 2022-05-29 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong , Ming Ding , Wendi Zheng , Xinghan Liu , Jie Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 12:17 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords text-to-video generationtransformer modelpretrainingvideo synthesishierarchical trainingimage-to-video transferCogVideo

0 comments

The pith

CogVideo generates videos from text by inheriting weights from a text-to-image model and applying multi-frame-rate hierarchical training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CogVideo, a 9-billion-parameter transformer for text-to-video generation. It tackles the prohibitive cost of training video models from scratch and the shortage of well-aligned text-video data by starting from the CogView2 image model and adding staged training that aligns descriptions with clips at varying speeds. A reader would care because the resulting open-source system produces more coherent motion and semantics than other public models, as measured by both automated metrics and human judgments. This shows a workable path to scale video synthesis without building every capability from zero data.

Core claim

Large-scale pretrained transformers have created milestones in text and text-to-image generation, yet video generation faces huge computation costs and scarce relevant datasets. We present the 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model CogView2. We also propose a multi-frame-rate hierarchical training strategy to better align text and video clips. As the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

What carries the argument

Weight inheritance from the CogView2 text-to-image model plus multi-frame-rate hierarchical training, which transfers static image understanding to dynamic video while aligning text semantics across frame rates.

If this is right

Generated videos exhibit stronger alignment between text descriptions and complex movements.
An open-source model at this scale becomes available for further research and applications.
Video generation can be scaled without full from-scratch training on massive video corpora.
The approach demonstrates transfer of capabilities from image to video domains via staged alignment training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inheritance-plus-hierarchy pattern might support longer or higher-resolution videos if compute budgets increase.
Fine-tuning on domain-specific video sets could adapt the model for tasks such as animation or simulation.
Combining the output with audio or 3D models could extend the system toward richer multimedia generation.

Load-bearing premise

That inheriting weights from a text-to-image model plus multi-frame-rate hierarchical training is enough to overcome scarce text-video data and the high cost of training video models from scratch.

What would settle it

Blind human preference tests or standard video quality metrics such as FVD in which CogVideo does not show a clear margin over other publicly released text-to-video systems.

read the original abstract

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CogVideo adapts a text-to-image model for video with hierarchical training, but its performance claims lack the supporting numbers and controls needed to evaluate them.

read the letter

The paper's main contribution is CogVideo, a large open-source transformer for text-to-video that builds directly on a text-to-image model with a new training schedule. This approach is new in its scale and openness for video generation. It does well by tackling the real barriers head on: the cost of training video models and the limited supply of good paired text-video data. Inheriting weights from CogView2 transfers some visual understanding, and the multi-frame-rate strategy aims to improve text-video alignment across different speeds. The results section claims big wins over existing models in both automated metrics and human judgments. If those hold with proper controls, it would be a useful benchmark. However, the abstract provides no actual scores, no list of baselines, and no ablations for the inheritance or the hierarchical training. This is a problem because the performance edge could come from many places. The stress-test note is on point here. We need to see if removing the inheritance drops performance or if single-rate training is enough. Without that, it's hard to credit the specific choices. The math and architecture seem standard transformer stuff adapted to video, which is fine. The citation pattern probably covers the relevant image and video gen papers. This work is for people building or evaluating generative video systems. A reader who wants to know how to scale up from image to video will find practical lessons, even if they have to implement the details themselves. It deserves a serious referee. The problem matters, the model is substantial, and the ideas are worth testing in review. I recommend sending it to peer review with requests for the quantitative results and ablations.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CogVideo, a 9B-parameter transformer for text-to-video generation. It inherits weights from the pretrained CogView2 text-to-image model to reduce compute costs and applies a multi-frame-rate hierarchical training strategy to improve text-video alignment despite limited relevant data. The authors claim CogVideo is likely the first open-source large-scale pretrained text-to-video model and outperforms all publicly available models by a large margin in both machine and human evaluations.

Significance. If the performance claims hold under rigorous controls, this would be a meaningful early contribution to text-to-video generation by showing how weight inheritance from image models and hierarchical training can scale to 9B parameters. The open release of the model is a clear strength that could enable follow-on work, analogous to the role of early large text and image models. However, the significance is reduced because the central empirical claim depends on unshown evidence that the proposed techniques, rather than model scale or dataset choices alone, drive the gains.

major comments (3)

[§4 (Experiments)] §4 (Experiments): No ablation is presented that isolates the effect of inheriting weights from CogView2 versus random initialization at 9B scale. This is load-bearing for the introduction's claim that inheritance overcomes text-video data scarcity; without it, observed gains could be explained by capacity or data alone.
[§4.1 (Evaluation protocol)] §4.1 (Evaluation protocol): The multi-frame-rate hierarchical training is not compared against a single-rate baseline in controlled experiments. This weakens the assertion that the hierarchical schedule is responsible for improved text-video alignment, as required to support the 'large margin' superiority claim.
[§4 (Experiments)] §4 (Experiments): The manuscript supplies no quantitative metrics (e.g., specific FID, CLIP-score, or human preference percentages), named baselines, or dataset statistics to substantiate the 'outperforms all publicly available models at a large margin' statement. These details are necessary to evaluate the central empirical claim.

minor comments (2)

[Abstract] The abstract would be improved by including at least one concrete quantitative result to support the performance claims.
Figure captions should be expanded to be self-contained, especially for any qualitative generation examples.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing honest clarifications on our design choices and empirical claims while noting where revisions are feasible.

read point-by-point responses

Referee: §4 (Experiments): No ablation is presented that isolates the effect of inheriting weights from CogView2 versus random initialization at 9B scale. This is load-bearing for the introduction's claim that inheritance overcomes text-video data scarcity; without it, observed gains could be explained by capacity or data alone.

Authors: We agree that a controlled ablation isolating weight inheritance at the full 9B scale would strengthen the claim regarding data scarcity. However, training a 9B-parameter model from random initialization requires prohibitive compute (estimated >10,000 GPU-hours per run), which exceeded our resources. Our approach follows established transfer-learning practices from image to video models, with performance gains shown via overall machine and human evaluations. In revision, we will expand Section 4 with additional discussion of this limitation and any supporting evidence from smaller-scale pretraining experiments. revision: partial
Referee: §4.1 (Evaluation protocol): The multi-frame-rate hierarchical training is not compared against a single-rate baseline in controlled experiments. This weakens the assertion that the hierarchical schedule is responsible for improved text-video alignment, as required to support the 'large margin' superiority claim.

Authors: We acknowledge that a direct single-rate baseline comparison would better isolate the hierarchical strategy's contribution. The multi-frame-rate approach was introduced to address varying motion speeds and improve alignment under data constraints, with benefits visible in qualitative results and overall metrics. Due to compute limits, this specific ablation was not performed. We will revise the manuscript to elaborate on the design rationale, add qualitative comparisons where possible, and list the missing ablation as a limitation and future direction. revision: partial
Referee: §4 (Experiments): The manuscript supplies no quantitative metrics (e.g., specific FID, CLIP-score, or human preference percentages), named baselines, or dataset statistics to substantiate the 'outperforms all publicly available models at a large margin' statement. These details are necessary to evaluate the central empirical claim.

Authors: We will revise the experiments section to report the specific quantitative metrics (FID, CLIP-score, human preference percentages), explicitly name all public baselines compared, and include dataset statistics. These details were available from our evaluations but omitted for brevity in the initial submission; adding them will allow direct assessment of the performance claims. revision: yes

standing simulated objections not resolved

Full 9B-scale ablation isolating weight inheritance from random initialization
Controlled ablation comparing multi-frame-rate hierarchical training to single-rate baseline

Circularity Check

0 steps flagged

No significant circularity; empirical performance claim is independently evaluated

full rationale

The paper's central claim is an empirical statement that CogVideo outperforms public baselines after inheriting weights from CogView2 and applying multi-frame-rate hierarchical training. This is supported by machine and human evaluations on external benchmarks rather than any derivation that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations that forbid alternatives appear in the provided abstract or described methodology. The inheritance from CogView2 and the training strategy are presented as engineering choices whose effectiveness is measured externally, not assumed or defined into the result. This is a standard self-contained empirical ML paper with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities are described beyond standard transformer assumptions.

axioms (1)

domain assumption A pretrained text-to-image transformer can be effectively adapted to video by adding temporal training
Invoked when the paper states it inherits from CogView2 to address computation cost

pith-pipeline@v0.9.0 · 5434 in / 1076 out tokens · 58400 ms · 2026-05-11T12:17:32.445086+00:00 · methodology

discussion (0)

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
cs.CV 2026-05 unverdicted novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
cs.CV 2026-04 unverdicted novelty 7.0

DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
cs.MM 2026-04 unverdicted novelty 7.0

MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
cs.CV 2026-04 unverdicted novelty 7.0

LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
cs.CV 2026-04 unverdicted novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
Detecting AI-Generated Videos with Spiking Neural Networks
cs.CV 2026-05 unverdicted novelty 6.0

MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
cs.CV 2026-05 unverdicted novelty 6.0

UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
PhyCo: Learning Controllable Physical Priors for Generative Motion
cs.CV 2026-04 unverdicted novelty 6.0

PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ...
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization
cs.CV 2026-04 unverdicted novelty 6.0

A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
cs.CV 2026-04 unverdicted novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
cs.CV 2026-04 unverdicted novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
cs.DC 2026-04 unverdicted novelty 6.0

GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
cs.CV 2025-03 accept novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Latent Video Diffusion Models for High-Fidelity Long Video Generation
cs.CV 2022-11 unverdicted novelty 6.0

Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
Make-A-Video: Text-to-Video Generation without Text-Video Data
cs.CV 2022-09 unverdicted novelty 6.0

Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Embody4D: A Generalist 4D World Model for Embodied AI
cs.CV 2026-05 unverdicted novelty 5.0

Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
Controllable Video Object Insertion via Multiview Priors
cs.CV 2026-04 unverdicted novelty 5.0

A multi-view prior-based framework for video object insertion that uses dual-path conditioning and an integration-aware consistency module to improve appearance stability and occlusion handling.
Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.
Not all tokens contribute equally to diffusion learning
cs.CV 2026-04 unverdicted novelty 5.0

DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
Open-Sora: Democratizing Efficient Video Production for All
cs.CV 2024-12 unverdicted novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
ModelScope Text-to-Video Technical Report
cs.CV 2023-08 unverdicted novelty 4.0

ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 40 Pith papers · 8 internal anchors

[1]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[2]

Carreira and A

J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6299–6308, 2017

work page 2017
[3]

7, 13, 16

J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018

work page arXiv 2018
[4]

arXiv:1907.06571 , year=

A. Clark, J. Donahue, and K. Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019

work page arXiv 1907
[5]

M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[6]

M. Ding, W. Zheng, W. Hong, and J. Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022

work page arXiv 2022
[7]

Esser, R

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. arXiv preprint arXiv:2012.09841, 2020

work page arXiv 2012
[8]

C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. Advances in neural information processing systems , 29, 2016

work page 2016
[9]

S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. arXiv preprint arXiv:2204.03638, 2022

work page arXiv 2022
[10]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial networks.arXiv preprint arXiv:1406.2661, 2014

work page internal anchor Pith review arXiv 2014
[11]

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022

work page internal anchor Pith review arXiv 2022
[12]

Karpathy, G

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classiﬁcation with convolutional neural networks. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014

work page 2014
[13]

J. Lin, C. Gan, and S. Han. Tsm: Temporal shift module for efﬁcient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 7083–7093, 2019

work page 2019
[14]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

work page 2021
[15]

P. Luc, A. Clark, S. Dieleman, D. d. L. Casas, Y . Doron, A. Cassirer, and K. Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020

work page arXiv 2003
[16]

Miech, D

A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2630–2640, 2019. 10

work page 2019
[17]

Latent video transformer

R. Rakhimov, D. V olkhonskiy, A. Artemov, D. Zorin, and E. Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020

work page arXiv 2006
[18]

Zero-Shot Text-to-Image Generation

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021

work page internal anchor Pith review arXiv 2021
[19]

Saito, E

M. Saito, E. Matsumoto, and S. Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on computer vision , pages 2830–2839, 2017

work page 2017
[20]

Saito, S

M. Saito, S. Saito, M. Koyama, and S. Kobayashi. Train sparsely, generate densely: Memory- efﬁcient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 128(10):2586–2606, 2020

work page 2020
[21]

Salimans, I

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2234–2242, 2016

work page 2016
[22]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[23]

Sutskever, J

I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In ICML’11, page 1017–1024, 2011

work page 2011
[24]

Y . Tian, J. Ren, M. Chai, K. Olszewski, X. Peng, D. N. Metaxas, and S. Tulyakov. A good image generator is what you need for high-resolution video synthesis.arXiv preprint arXiv:2104.15069, 2021

work page arXiv 2021
[25]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015

work page 2015
[26]

Tulyakov, M.-Y

S. Tulyakov, M.-Y . Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

work page 2018
[27]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. To- wards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

van den Oord, O

A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems , pages 6309–6318, 2017

work page 2017
[29]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

V ondrick, H

C. V ondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics.Advances in neural information processing systems , 29, 2016

work page 2016
[31]

X. Wang, J. Wu, J. Chen, L. Li, Y .-F. Wang, and W. Y . Wang. Vatex: A large-scale, high- quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019

work page 2019
[32]

Y . Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in neural information processing systems, 30, 2017

work page 2017
[33]

Scaling autoregressive video models

D. Weissenborn, O. Täckström, and J. Uszkoreit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019

work page arXiv 1906
[34]

C. Wu, L. Huang, Q. Zhang, B. Li, L. Ji, F. Yang, G. Sapiro, and N. Duan. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021. 11

work page arXiv 2021
[35]

C. Wu, J. Liang, L. Ji, F. Yang, Y . Fang, D. Jiang, and N. Duan. N\" uwa: Visual synthesis pre-training for neural visual world creation. arXiv preprint arXiv:2111.12417, 2021

work page arXiv 2021
[36]

W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review arXiv 2021
[37]

S. Yu, J. Tack, S. Mo, H. Kim, J. Kim, J.-W. Ha, and J. Shin. Generating videos with dynamics- aware implicit generative adversarial networks. arXiv preprint arXiv:2202.10571, 2022. A Attention Analysis To explore the attention mechanism of dual-channel attention, we visualize (1) the attention distribu- tion in the temporal channel and (2) the mixture fa...

work page arXiv 2022