super hub Canonical reference

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Jin Zhou, Qi Tian, Rox Min, Weijie Kong, Zijian Zhang, Zuozhuo Dai · 2024 · cs.CV · arXiv 2412.03603

Canonical reference. 85% of citing Pith papers cite this work as background.

340 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 340 citing papers more from Jin Zhou arXiv PDF

abstract

Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at https://github.com/Tencent/HunyuanVideo.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 73 baseline 6 method 3 dataset 2 other 1

citation-polarity summary

background 72 baseline 6 use method 3 unclear 2 use dataset 2

claims ledger

abstract Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including

authors

Jin Zhou Qi Tian Rox Min Weijie Kong Zijian Zhang Zuozhuo Dai

co-cited works

representative citing papers

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

cs.CV · 2026-06-09 · conditional · novelty 8.0

Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Introduces CIPE-Dance as the largest dance video dataset and OmniDance framework for unified text-music multimodal dance video generation achieving SOTA on TI2V, MI2V, and MTI2V tasks.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

RayPE extends video DiT attention with Plucker coordinates and a gated reciprocal-product term to improve 3D consistency and camera controllability.

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

FadeMem introduces distance-aware KV memory consolidation for autoregressive video diffusion that builds a temporal hierarchy with power-law merging to preserve short-term dynamics and long-range coherence under fixed cache budget.

OmniTryOn: Video Try-On Anything at Once!

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Ultra-Fast Neural Video Compression

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

DCVC-UF uses chunk-based joint encoding and parallel frame-specific decoding to deliver ultra-fast neural video compression while claiming new state-of-the-art rate-distortion performance.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

From Zero to Hero: Training-Free Custom Concept Spawning in World Models

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.

Paris 2.0: A Decentralized Diffusion Model for Video Generation

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

Paris 2.0 is the first decentralized diffusion model for text-to-video generation and reports roughly 2x lower FVD than a monolithic baseline under matched total compute.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation cs.GR · 2026-04-28 · unverdicted · none · ref 10 · internal anchor
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation cs.GR · 2026-04-08 · unverdicted · none · ref 19 · internal anchor
MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance cs.GR · 2026-05-31 · unverdicted · none · ref 23 · internal anchor
AlbedoEdit fine-tunes video foundation models to translate RGB videos into edited versions conditioned on user-edited first-frame albedo maps, trained on a new synthetic paired dataset for insertion, removal, and texture tasks.
SURF: Signature-Retained Fast Video Generation cs.GR · 2025-11-25 · unverdicted · none · ref 18 · internal anchor
SURF accelerates high-resolution video generation up to 12.5x by using noise reshifting for low-res previews from pretrained models and a shifting-window Refiner for efficient upscaling that retains original signatures.

HunyuanVideo: A Systematic Framework For Large Video Generative Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer