hub Canonical reference

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen · 2022 · cs.CV · arXiv 2211.13221

Canonical reference. 92% of citing Pith papers cite this work as background.

39 Pith papers citing it

Background 92% of classified citations

open full Pith review browse 39 citing papers arXiv PDF

abstract

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 method 1

citation-polarity summary

background 11 use method 1

representative citing papers

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

cs.CV · 2026-04-03 · conditional · novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

cs.CV · 2026-03-10 · unverdicted · novelty 7.0

FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.

EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.

One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

cs.CV · 2025-11-28 · unverdicted · novelty 7.0

One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and token replacement for long videos.

History-Guided Video Diffusion

cs.LG · 2025-02-10 · unverdicted · novelty 7.0

DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.

DiffATS: Diffusion in Aligned Tensor Space

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.

Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.

Latent-Compressed Variational Autoencoder for Video Diffusion Models

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

cs.CV · 2026-02-02 · conditional · novelty 6.0 · 2 refs

Causal Forcing uses an autoregressive teacher for ODE initialization in diffusion distillation to close the causal attention gap and deliver better real-time video generation than Self Forcing.

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

cs.CV · 2025-12-04 · conditional · novelty 6.0

Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

cs.CV · 2025-10-02 · conditional · novelty 6.0

Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.

LongLive: Real-time Interactive Long Video Generation

cs.CV · 2025-09-26 · conditional · novelty 6.0

LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.

ReSim: Reliable World Simulation for Autonomous Driving

cs.CV · 2025-06-11 · unverdicted · novelty 6.0

ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module for estimating action quality from simulated futures.

We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

cs.CV · 2025-04-24 · unverdicted · novelty 6.0

NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt alignment by nearly 40%.

citing papers explorer

Showing 39 of 39 citing papers.

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 17 · internal anchor
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization cs.CV · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation cs.CV · 2026-05-07 · unverdicted · none · ref 14 · internal anchor
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
Efficient Video Diffusion Models: Advancements and Challenges cs.CV · 2026-04-17 · unverdicted · none · ref 291 · internal anchor
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation cs.CV · 2026-04-03 · conditional · none · ref 17 · internal anchor
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation cs.CV · 2026-03-18 · unverdicted · none · ref 22 · internal anchor
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation cs.CV · 2026-03-10 · unverdicted · none · ref 14 · internal anchor
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos cs.CV · 2026-03-03 · unverdicted · none · ref 4 · internal anchor
EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer cs.CV · 2025-11-28 · unverdicted · none · ref 13 · internal anchor
One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and token replacement for long videos.
History-Guided Video Diffusion cs.LG · 2025-02-10 · unverdicted · none · ref 20 · internal anchor
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 197 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data cs.RO · 2026-05-13 · unverdicted · none · ref 51 · internal anchor
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation cs.CV · 2026-05-10 · unverdicted · none · ref 8 · internal anchor
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.
DiffATS: Diffusion in Aligned Tensor Space cs.LG · 2026-05-10 · unverdicted · none · ref 16 · internal anchor
DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV · 2026-04-28 · unverdicted · none · ref 19 · internal anchor
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos cs.CV · 2026-04-20 · unverdicted · none · ref 14 · internal anchor
EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 12 · internal anchor
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
Latent-Compressed Variational Autoencoder for Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 18 · internal anchor
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation cs.CV · 2026-02-02 · conditional · none · ref 12 · 2 links · internal anchor
Causal Forcing uses an autoregressive teacher for ODE initialization in diffusion distillation to close the causal attention gap and deliver better real-time video generation than Self Forcing.
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation cs.CV · 2025-12-04 · conditional · none · ref 23 · internal anchor
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation cs.CV · 2025-10-02 · conditional · none · ref 16 · internal anchor
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
LongLive: Real-time Interactive Long Video Generation cs.CV · 2025-09-26 · conditional · none · ref 13 · internal anchor
LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
ReSim: Reliable World Simulation for Autonomous Driving cs.CV · 2025-06-11 · unverdicted · none · ref 106 · internal anchor
ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module for estimating action quality from simulated futures.
We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback cs.CV · 2025-04-24 · unverdicted · none · ref 29 · internal anchor
NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt alignment by nearly 40%.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness cs.CV · 2025-03-27 · accept · none · ref 53 · internal anchor
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.
Long-Context Autoregressive Video Modeling with Next-Frame Prediction cs.CV · 2025-03-25 · unverdicted · none · ref 49 · internal anchor
FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 115 · internal anchor
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Latte: Latent Diffusion Transformer for Video Generation cs.CV · 2024-01-05 · unverdicted · none · ref 4 · internal anchor
Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-practice training choices.
VideoPoet: A Large Language Model for Zero-Shot Video Generation cs.CV · 2023-12-21 · unverdicted · none · ref 17 · internal anchor
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation cs.CV · 2023-10-30 · unverdicted · none · ref 24 · internal anchor
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory cs.CV · 2023-08-16 · unverdicted · none · ref 294 · internal anchor
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation cs.CV · 2026-05-15 · conditional · none · ref 12 · 2 links · internal anchor
SWoMo decouples symbolic rule-based motion modeling via scene graphs from visual realism via diffusion models, trained through inverse pairing of real cataract surgery videos reconstructed in the simulator for sim-to-real translation.
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation cs.CV · 2026-04-29 · unverdicted · none · ref 9 · internal anchor
DepthPilot generates physically consistent and clinically interpretable colonoscopy videos by injecting depth priors into diffusion models through parameter-efficient fine-tuning and replacing linear denoising weights with adaptive splines.
Not all tokens contribute equally to diffusion learning cs.CV · 2026-04-08 · unverdicted · none · ref 6 · internal anchor
DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment cs.RO · 2025-04-22 · unverdicted · none · ref 24 · internal anchor
DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.
Empowering Video Translation using Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 127 · internal anchor
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
Character-Centered Dialogue Generation from Scene-Level Prompts cs.CV · 2025-05-22 · unverdicted · none · ref 24 · internal anchor
A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.
Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling cs.CV · 2025-03-08 · unverdicted · none · ref 13 · internal anchor
A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text without retraining.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 55 · internal anchor
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Latent Video Diffusion Models for High-Fidelity Long Video Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer