AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
super hub Canonical reference
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Canonical reference. 76% of citing Pith papers cite this work as background.
abstract
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and v
authors
co-cited works
representative citing papers
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.
Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
ORBIS uses output-guided token reduction and DATM to achieve 2x higher token reduction than AsymRnR, with up to 4.5x speedup and 79.3% energy savings versus A100 GPU for video DiT models.
PREX decomposes target 4D video volumes into Preserve, Reveal, and Expand roles with a region-aware adapter on a frozen diffusion backbone, trained via proxy tasks, and introduces the PREBench benchmark to reduce region-structured editing failures.
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
citing papers explorer
-
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
-
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
-
Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction
C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.
-
DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.
-
YoCausal: How Far is Video Generation from World Model? A Causality Perspective
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
-
Geo-Align: Video Generation Alignment via Metric Geometry Reward
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
-
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
-
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
-
ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration
ORBIS uses output-guided token reduction and DATM to achieve 2x higher token reduction than AsymRnR, with up to 4.5x speedup and 79.3% energy savings versus A100 GPU for video DiT models.
-
Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning
PREX decomposes target 4D video volumes into Preserve, Reveal, and Expand roles with a region-aware adapter on a frozen diffusion backbone, trained via proxy tasks, and introduces the PREBench benchmark to reduce region-structured editing failures.
-
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA
-
StreamingEffect: Real-Time Human-Centric Video Effect Generation
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
-
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
-
WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation
WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零
-
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
-
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
-
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
-
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
-
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
-
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
-
OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos
OphEdit enables text-guided editing of eye surgery videos without training by injecting preserved attention value tensors into the diffusion denoising process to maintain anatomical structure.
-
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
-
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.
-
OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer
OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
-
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
-
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
-
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
-
HumanScore: Benchmarking Human Motions in Generated Videos
HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.
-
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
-
ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes
ViPS learns a universal, controllable pose space for auto-rigged meshes by transferring motion priors from video diffusion models, matching SOTA performance on plausibility and diversity while enabling zero-shot generalization.
-
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
-
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.
-
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% better accuracy than prior methods.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning
DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.