super hub Canonical reference

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Jiayan Teng, Jiazheng Xu, Ming Ding, Shiyu Huang, Wendi Zheng, Zhuoyi Yang · 2024 · cs.CV · arXiv 2408.06072

Canonical reference. 76% of citing Pith papers cite this work as background.

293 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 293 citing papers more from Jiayan Teng arXiv PDF

abstract

We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 61 method 9 baseline 7 dataset 1

citation-polarity summary

background 59 use method 9 baseline 7 unclear 2 use dataset 1

claims ledger

abstract We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and v

authors

Jiayan Teng Jiazheng Xu Ming Ding Shiyu Huang Wendi Zheng Zhuoyi Yang

co-cited works

representative citing papers

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

ISPA reduces KV cache size by up to 50% in AR video models by transitioning layers to local attention and applying instance-specific least-squares weight modulation to compensate for lost history.

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

DVG-WM disentangles dynamics learning and visual synthesis in video world models using flow matching and latent degradation to achieve faster inference up to 3.97 times with improved quality on LIBERO and real-world robotic platforms.

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Introduces CIPE-Dance as the largest dance video dataset and OmniDance framework for unified text-music multimodal dance video generation achieving SOTA on TI2V, MI2V, and MTI2V tasks.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Ultra-Fast Neural Video Compression

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

DCVC-UF uses chunk-based joint encoding and parallel frame-specific decoding to deliver ultra-fast neural video compression while claiming new state-of-the-art rate-distortion performance.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

From Zero to Hero: Training-Free Custom Concept Spawning in World Models

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.

DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.

DeltaCam: Differential Intrinsic Camera Modeling for Video Generation

cs.CV · 2026-05-24 · unverdicted · novelty 7.0

DeltaCam models relative changes in camera intrinsics via Δ-parameterized neural adaptors in video diffusion models trained on synthetic data to enable controllable generation and real-world transfer.

World Models as Group Actions

cs.CV · 2026-05-23 · unverdicted · novelty 7.0

Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.

citing papers explorer

Showing 50 of 249 citing papers after filters.

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions cs.CV · 2026-05-28 · unverdicted · none · ref 32 · internal anchor
PhyGenHOI couples a motion diffusion model for humans with material point method simulation for objects on 3D Gaussians, using attraction loss, contact re-simulation, and masked video-SDS to produce physically consistent dynamic interactions from text.
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models cs.CV · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
minWM supplies an end-to-end pipeline that fine-tunes bidirectional T2V/TI2V models with camera control then distills them via Causal Forcing into few-step autoregressive generators for low-latency rollout.
Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation cs.CV · 2026-05-27 · unverdicted · none · ref 38 · internal anchor
Proprio uses flow residuals from latent perturbations in frozen video generators as a self-scoring signal for physical plausibility, yielding reported gains of 16.5% on Physics-IQ and 20.6% on VideoPhy2-hard.
SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control cs.CV · 2026-05-27 · unverdicted · none · ref 13 · internal anchor
SmartDirector generates cinematic videos via Director-Gen for low-res keyframe-conditioned output followed by Director-SR refinement using high-res keyframes, trained on curated movie sequences.
PARE: Pruning and Adaptive Routing for Efficient Video Generation cs.CV · 2026-05-26 · unverdicted · none · ref 39 · internal anchor
PARE applies structure-aware head pruning and timestep/content-conditioned block routing to compress video DiTs, reducing per-step compute while preserving quality on Wan2.1-14B.
LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation cs.CV · 2026-05-22 · unverdicted · none · ref 24 · internal anchor
LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.
SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models cs.CV · 2026-05-22 · unverdicted · none · ref 61 · internal anchor
SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.
SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion cs.CV · 2026-05-22 · unverdicted · none · ref 13 · internal anchor
SimInsert is a training-free video object insertion technique that decouples the task into single-frame editing and semantic motion description, using image-to-video diffusion models with non-invasive guidance to achieve spatio-temporal coherence.
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation cs.CV · 2026-05-20 · unverdicted · none · ref 64 · 2 links · internal anchor
GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.
Rethinking Cross-Layer Information Routing in Diffusion Transformers cs.CV · 2026-05-20 · unverdicted · none · ref 62 · 2 links · internal anchor
DAR replaces residual addition in DiTs with learnable, timestep-adaptive aggregation of sublayer outputs, yielding 2.11 FID improvement on SiT-XL/2 and 8.75x faster convergence on ImageNet 256x256.
Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 141 · 2 links · internal anchor
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 86 · internal anchor
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
GeoWorld-VLM: Geometry from World Models for Vision-Language Models cs.CV · 2026-05-15 · unverdicted · none · ref 52 · 2 links · internal anchor
GeoWorld-VLM aligns VLM image features with intermediate representations from camera-conditioned world models via fine-tuning only the encoder and projector, yielding ~4% gains on What'sUp and VSR spatial benchmarks across two VLM backbones.
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization cs.CV · 2026-05-15 · unverdicted · none · ref 29 · 2 links · internal anchor
Flash-GRPO is a one-step GRPO framework for video diffusion alignment that applies iso-temporal grouping and temporal gradient rectification to achieve higher alignment quality and stability than full-trajectory training under low compute budgets on 1.3B-14B models.
FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization cs.CV · 2026-05-15 · unverdicted · none · ref 3 · 2 links · internal anchor
FashionChameleon achieves interactive multi-garment video customization at 23.8 FPS via in-context teacher models, streaming distillation, and training-free KV cache rescheduling while using only single-garment data.
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video cs.CV · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
ReactiveGWM: Steering NPC in Reactive Game World Models cs.CV · 2026-05-14 · unverdicted · none · ref 44 · internal anchor
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration cs.CV · 2026-05-14 · unverdicted · none · ref 35 · internal anchor
EverAnimate restores drifted latent flow trajectories in chunked video generation via persistent latent propagation and restorative flow matching, achieving measurable gains in PSNR, SSIM, LPIPS, and FID over prior long-animation methods with only LoRA tuning.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction cs.CV · 2026-05-14 · unverdicted · none · ref 114 · internal anchor
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity cs.CV · 2026-05-14 · unverdicted · none · ref 59 · internal anchor
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow cs.CV · 2026-05-13 · unverdicted · none · ref 93 · 2 links · internal anchor
R-DMesh proposes a VAE-based disentanglement of base mesh, motion trajectories, and rectification offset plus Triflow Attention and rectified-flow diffusion to produce 4D meshes aligned to video despite initial pose mismatch.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation cs.CV · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation cs.CV · 2026-05-12 · unverdicted · none · ref 26 · 2 links · internal anchor
HorizonDrive is a new anti-drifting autoregressive training and distillation method that enables minute-scale stable driving video rollouts by making the teacher model rollout-capable via scheduled rollout recovery and teacher rollout DMD.
PresentAgent-2: Towards Generalist Multimodal Presentation Agents cs.CV · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors cs.CV · 2026-05-11 · unverdicted · none · ref 28 · internal anchor
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models cs.CV · 2026-05-08 · unverdicted · none · ref 15 · 2 links · internal anchor
SARA introduces semantic saliency to guide relational alignment in video diffusion models, improving text following and motion quality over prior alignment methods.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation cs.CV · 2026-05-07 · unverdicted · none · ref 29 · 2 links · internal anchor
SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution synthesis.
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control cs.CV · 2026-05-07 · unverdicted · none · ref 30 · internal anchor
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention cs.CV · 2026-05-06 · unverdicted · none · ref 13 · internal anchor
LIVEditor-14B applies a new sparse attention method (ISA) that prunes context and uses query-sharpness routing to cut attention latency ~60% with no loss in editing quality on standard benchmarks.
Stream-T1: Test-Time Scaling for Streaming Video Generation cs.CV · 2026-05-06 · unverdicted · none · ref 46 · internal anchor
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.
Motion-Aware Caching for Efficient Autoregressive Video Generation cs.CV · 2026-05-03 · conditional · none · ref 41 · 2 links · internal anchor
MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 54 · internal anchor
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors cs.CV · 2026-05-01 · unverdicted · none · ref 31 · internal anchor
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
PhyCo: Learning Controllable Physical Priors for Generative Motion cs.CV · 2026-04-30 · unverdicted · none · ref 46 · internal anchor
PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ benchmark.
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models cs.CV · 2026-04-29 · unverdicted · none · ref 37 · internal anchor
SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models cs.CV · 2026-04-29 · unverdicted · none · ref 33 · internal anchor
L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution cs.CV · 2026-04-29 · unverdicted · none · ref 38 · internal anchor
MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV · 2026-04-28 · unverdicted · none · ref 55 · internal anchor
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation cs.CV · 2026-04-28 · unverdicted · none · ref 27 · internal anchor
HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt benchmark.
PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics cs.CV · 2026-04-26 · unverdicted · none · ref 5 · internal anchor
PhysLayer is a framework that decomposes images into depth layers, simulates physics with depth awareness, and synthesizes videos guided by language for more plausible animations.
BurstGP: Enhancing Raw Burst Image Super Resolution with Generative Priors cs.CV · 2026-04-26 · unverdicted · none · ref 71 · internal anchor
BurstGP enhances raw burst image super-resolution by integrating pretrained video diffusion priors through a multiframe-aware model, degradation-aware conditioning, and color-space conversion, outperforming prior methods on perceptual metrics.
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion cs.CV · 2026-04-22 · unverdicted · none · ref 12 · internal anchor
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with a semantic motion router.
How Far Are Video Models from True Multimodal Reasoning? cs.CV · 2026-04-21 · unverdicted · none · ref 85 · internal anchor
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 52 · internal anchor
AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 54 · internal anchor
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation cs.CV · 2026-04-19 · unverdicted · none · ref 32 · internal anchor
Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior cs.CV · 2026-04-19 · unverdicted · none · ref 53 · internal anchor
DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer cs.CV · 2026-04-15 · unverdicted · none · ref 36 · internal anchor
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and reference-guided video stylization.
VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning cs.CV · 2026-04-15 · unverdicted · none · ref 38 · internal anchor
VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and consistency regularization.
Generative Refinement Networks for Visual Synthesis cs.CV · 2026-04-14 · unverdicted · none · ref 59 · internal anchor
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer