TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
hub Canonical reference
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Canonical reference. 92% of citing Pith papers cite this work as background.
abstract
AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.
One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and token replacement for long videos.
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.
DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module for estimating action quality from simulated futures.
NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt alignment by nearly 40%.
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.