pith. sign in

hub Canonical reference

VideoGPT: Video Generation using VQ-VAE and Transformers

Canonical reference. 79% of citing Pith papers cite this work as background.

42 Pith papers citing it
Background 79% of classified citations
abstract

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html

hub tools

citation-role summary

background 12 baseline 2

citation-polarity summary

representative citing papers

Video Diffusion Models

cs.CV · 2022-04-07 · unverdicted · novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.

High-Resolution Image Synthesis with Latent Diffusion Models

cs.CV · 2021-12-20 · conditional · novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and

Network-Efficient World Model Token Streaming

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bitrates for tokenized driving world models.

CASCADE: Context-Aware Relaxation for Speculative Image Decoding

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.

Stream-T1: Test-Time Scaling for Streaming Video Generation

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.

A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

A hybrid transformer-FEM integrator provides provable discrete energy preservation and gradient bounds for stable autoregressive forecasting of chaotic systems, with 65x fewer parameters and 9000x speedup in a fusion surrogate trained on 12 simulations.

Generative Refinement Networks for Visual Synthesis

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

cs.CV · 2025-09-29 · unverdicted · novelty 6.0

Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.

ReSim: Reliable World Simulation for Autonomous Driving

cs.CV · 2025-06-11 · unverdicted · novelty 6.0

ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module for estimating action quality from simulated futures.

citing papers explorer

Showing 42 of 42 citing papers.