super hub Canonical reference

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Jin Zhou, Qi Tian, Rox Min, Weijie Kong, Zijian Zhang, Zuozhuo Dai · 2024 · cs.CV · arXiv 2412.03603

Canonical reference. 85% of citing Pith papers cite this work as background.

343 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 343 citing papers more from Jin Zhou arXiv PDF

abstract

Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at https://github.com/Tencent/HunyuanVideo.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 73 baseline 6 method 3 dataset 2 other 1

citation-polarity summary

background 72 baseline 6 use method 3 unclear 2 use dataset 2

claims ledger

abstract Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including

authors

Jin Zhou Qi Tian Rox Min Weijie Kong Zijian Zhang Zuozhuo Dai

co-cited works

representative citing papers

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

cs.CV · 2026-06-09 · conditional · novelty 8.0

Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Introduces CIPE-Dance as the largest dance video dataset and OmniDance framework for unified text-music multimodal dance video generation achieving SOTA on TI2V, MI2V, and MTI2V tasks.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

RayPE extends video DiT attention with Plucker coordinates and a gated reciprocal-product term to improve 3D consistency and camera controllability.

World Model Self-Distillation: Training World Models to Solve General Tasks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

FadeMem introduces distance-aware KV memory consolidation for autoregressive video diffusion that builds a temporal hierarchy with power-law merging to preserve short-term dynamics and long-range coherence under fixed cache budget.

OmniTryOn: Video Try-On Anything at Once!

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Ultra-Fast Neural Video Compression

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

DCVC-UF uses chunk-based joint encoding and parallel frame-specific decoding to deliver ultra-fast neural video compression while claiming new state-of-the-art rate-distortion performance.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

From Zero to Hero: Training-Free Custom Concept Spawning in World Models

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.

Paris 2.0: A Decentralized Diffusion Model for Video Generation

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

Paris 2.0 is the first decentralized diffusion model for text-to-video generation and reports roughly 2x lower FVD than a monolithic baseline under matched total compute.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.

citing papers explorer

Showing 50 of 343 citing papers.

SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation cs.CV · 2026-06-29 · unverdicted · none · ref 17 · internal anchor
SyncCache accelerates DiT-based audio-driven portrait animation up to 4.12x via spatially-asymmetric probing and modality-decoupled caching while preserving near-lossless quality and audio sync.
AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation cs.CV · 2026-06-29 · unverdicted · none · ref 29 · internal anchor
AVTok is a unified tokenizer that converts audio-video pairs into a compact 1D latent representation via dual-stream transformer and hierarchical training for improved reconstruction and cross-modal generation.
EcoVideo: Entropy-Orchestrated Video Generation Paradigm in Cloud-Edge Dynamics cs.CV · 2026-06-29 · unverdicted · none · ref 12 · internal anchor
EcoVideo introduces entropy-driven dynamic frame selection for cloud-edge DiT video generation, yielding up to 2.9x speedup with adaptive keyframe budgets.
Semantic-Aware, Physics-Informed, Geometry-Grounded Weather Video Synthesis cs.CV · 2026-06-27 · unverdicted · none · ref 35 · internal anchor
A new framework factorizes weather video synthesis into semantic appearance anchoring, physics-informed Gaussian particle simulation under gravity/wind/turbulence, and geometry-grounded alignment to produce diverse realistic weather effects.
EMOSH: Expressive Motion and Shape Disentanglement for Human Animation cs.CV · 2026-06-26 · unverdicted · none · ref 25 · internal anchor
EMOSH proposes an Expressive Human Model with disentangled parameters, coarse-to-fine motion injection, and spatially-aligned conditioning to generate high-fidelity expressive human videos without driving-subject shape leakage.
SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation cs.CV · 2026-06-10 · unverdicted · none · ref 1 · internal anchor
SpecLoR rectifies the amplitude spectrum of lookahead-estimated clean latents to natural-video priors during early ODE sampling steps, cutting physical artifacts with only four extra NFEs.
ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation cs.CV · 2026-06-10 · unverdicted · none · ref 25 · internal anchor
ARGUS converts MLLM-selected identity evidence into a synchronized 3x3 mosaic injected as negative-time memory in a diffusion model, plus supporting training techniques, to achieve SOTA subject preservation on human video benchmarks.
Latent Spatial Memory for Video World Models cs.CV · 2026-06-08 · unverdicted · none · ref 3 · internal anchor
Mirage stores and queries 3D scene information in diffusion latent space via depth-guided lifting and warping, yielding 10.57× faster generation and 55× smaller memory than explicit RGB point-cloud baselines while reaching SOTA on WorldScore.
Prisma-World: Camera-Controllable Multi-Agent Video World Model cs.CV · 2026-06-08 · unverdicted · none · ref 1 · internal anchor
Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from flexible agent counts.
LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution cs.CV · 2026-06-08 · unverdicted · none · ref 4 · internal anchor
LiteVSR performs video super-resolution on a completely frozen Diffusion Transformer via a lightweight State-Aware Adapter that uses dual-stream extraction and time-dependent cross-attention, reaching competitive quality with 11.25% trainable parameters after 12 GPU-hours.
OmniGen-AR: AutoRegressive Any-to-Image Generation cs.CV · 2026-06-08 · unverdicted · none · ref 35 · internal anchor
OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB
DisCo: World Models with Discrete Camera Motion Control cs.CV · 2026-06-06 · unverdicted · none · ref 24 · internal anchor
DisCo uses discrete action primitives for camera control in video world models to achieve more reliable action following than continuous trajectories.
Streaming Video Generation with Streaming Force Control cs.CV · 2026-06-05 · unverdicted · none · ref 30 · internal anchor
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling cs.CV · 2026-06-04 · unverdicted · none · ref 18 · internal anchor
RhymeFlow is a training-free acceleration framework that decouples denoising trajectories across video frames by dense processing of semantic keyframes and asynchronous skipping for non-keyframes, augmented by a latent trajectory projection module to maintain consistency.
ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCE cs.CV · 2026-06-04 · unverdicted · none · ref 28 · internal anchor
ReCache learns recomputation schedules via policy gradients to maximize quality under a target compute budget for any caching mechanism in diffusion models.
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation cs.MM · 2026-06-03 · unverdicted · none · ref 26 · internal anchor
Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.
AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation cs.CV · 2026-06-02 · unverdicted · none · ref 8 · internal anchor
AAD-1 uses a causal generator with a bidirectional holistic discriminator plus phased distribution matching before adversarial training to reach state-of-the-art one-step autoregressive video generation on VBench.
Video-Mirai: Autoregressive Video Diffusion Models Need Foresight cs.CV · 2026-06-02 · unverdicted · none · ref 20 · internal anchor
Training method distills non-causal future targets into causal video diffusion states to boost long-horizon consistency without changing inference architecture or cost.
Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting cs.CV · 2026-06-02 · unverdicted · none · ref 26 · internal anchor
Prompt-aware weighting strategies W-Switch and W-Composite improve multi-concept LoRA composition in diffusion models without training.
MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data cs.CV · 2026-06-01 · unverdicted · none · ref 17 · internal anchor
MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.
Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation cs.CV · 2026-06-01 · unverdicted · none · ref 27 · internal anchor
COVRAG improves long-horizon geometric consistency in autoregressive video generation via coverage-maximizing retrieval on lightweight depth-based 3D memory evidence.
Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation cs.CV · 2026-06-01 · unverdicted · none · ref 9 · internal anchor
Auteur formalizes human-centric camera framing as a DSL, uses a fine-tuned MLLM to map text and motion to DSL keyframes, and interpolates them into trajectories for video generators.
PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion cs.CV · 2026-05-31 · unverdicted · none · ref 22 · internal anchor
PAI-Studio reformulates cinematic background replacement as in-context conditional generation inside a Diffusion Transformer with bidirectional attention, trained on a new 30K film-sourced dataset, and reports better motion consistency and relighting than prior open-source and commercial systems.
AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance cs.GR · 2026-05-31 · unverdicted · none · ref 23 · internal anchor
AlbedoEdit fine-tunes video foundation models to translate RGB videos into edited versions conditioned on user-edited first-frame albedo maps, trained on a new synthetic paired dataset for insertion, removal, and texture tasks.
Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models cs.CV · 2026-05-29 · unverdicted · none · ref 22 · internal anchor
Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.
LVSA: Training-Free Sparse Attention for Long Video Diffusion cs.CV · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
LVSA is a training-free block-sparse attention technique combining structured windows with rotating global anchors that reduces inference compute 2.98-3.33x on video diffusion models at extended horizons while remaining quality-neutral or positive.
CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping cs.CV · 2026-05-29 · unverdicted · none · ref 17 · internal anchor
CameraNoise embeds camera motion into the noise space of video diffusion via Geometry-guided Reprojection Flow and noise warping to achieve faithful trajectory control while preserving the diffusion prior.
OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation cs.CV · 2026-05-28 · unverdicted · none · ref 6 · internal anchor
OmniMem enables scalable long video generation via adaptive sparse KV retrieval that addresses local bias and union explosion while preserving explicit historical access.
VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation cs.CV · 2026-05-28 · unverdicted · none · ref 22 · internal anchor
VPG is a training-free inference-time guidance technique that improves autoregressive image and video generation by contrasting model outputs under generated versus corrupted prefixes to strengthen next-step support for the prefix.
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models cs.CV · 2026-05-28 · unverdicted · none · ref 7 · internal anchor
minWM supplies an end-to-end pipeline that fine-tunes bidirectional T2V/TI2V models with camera control then distills them via Causal Forcing into few-step autoregressive generators for low-latency rollout.
Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players cs.CV · 2026-05-27 · unverdicted · none · ref 28 · internal anchor
A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.
Refining Multidimensional Video Reward Models via Disentangled Influence Functions cs.LG · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
Introduces dimension-disentangled influence estimation to prune or reweight training samples for MVRMs, outperforming global scalar filtering in alignment with ground truth.
SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control cs.CV · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
SmartDirector generates cinematic videos via Director-Gen for low-res keyframe-conditioned output followed by Director-SR refinement using high-res keyframes, trained on curated movie sequences.
PARE: Pruning and Adaptive Routing for Efficient Video Generation cs.CV · 2026-05-26 · unverdicted · none · ref 15 · internal anchor
PARE applies structure-aware head pruning and timestep/content-conditioned block routing to compress video DiTs, reducing per-step compute while preserving quality on Wan2.1-14B.
Adversarial Dual On-Policy Distillation from Expressive Teacher cs.LG · 2026-05-26 · unverdicted · none · ref 7 · internal anchor
FA-OPD co-trains a flow-matching teacher and MLP student via adversarial dual on-policy distillation, improving robustness over baselines on six robot benchmarks with noisy or limited demonstrations.
CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection cs.CV · 2026-05-26 · unverdicted · none · ref 43 · internal anchor
Introduces a commercial-model contrastive AIGC video dataset and a hybrid contrastive-MLLM detection framework claiming SOTA performance on realistic video forgery detection.
DexSIM: Real-time Dexterous Simulation with Unified Causal Video Diffusion cs.CV · 2026-05-23 · unverdicted · none · ref 8 · internal anchor
DexSIM is a bi-directional video diffusion model with hand trajectory embedding and spatial memory cache for real-time dexterous hand-object simulation at 15 FPS.
LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation cs.CV · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.
SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models cs.CV · 2026-05-22 · unverdicted · none · ref 29 · internal anchor
SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation cs.CV · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.
FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching cs.CV · 2026-05-20 · unverdicted · none · ref 17 · internal anchor
FlowLong generates videos several times longer than native model windows by blending adjacent predictions with Tweedie matching to enforce manifold and temporal consistency while using stochastic noise injection early and deterministic sampling later.
What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing cs.CV · 2026-05-20 · unverdicted · none · ref 10 · 2 links · internal anchor
Introduces TRACE-Edit dataset and evaluation protocol demonstrating semantic degradation of structural variables during VLM-to-DiT alignment in flow-matching video editors.
Rethinking Cross-Layer Information Routing in Diffusion Transformers cs.CV · 2026-05-20 · unverdicted · none · ref 25 · 2 links · internal anchor
DAR replaces residual addition in DiTs with learnable, timestep-adaptive aggregation of sublayer outputs, yielding 2.11 FID improvement on SiT-XL/2 and 8.75x faster convergence on ImageNet 256x256.
Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models cs.CV · 2026-05-20 · unverdicted · none · ref 16 · internal anchor
AVIS applies autoregressive diffusion models to video inverse problems by streaming restoration with measurement-consistent initialization, reducing latency from 114s to 4s and raising throughput to 1.18 FPS (or 5.91 FPS in the Flash variant).
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks cs.CV · 2026-05-19 · unverdicted · none · ref 51 · internal anchor
Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards cs.CV · 2026-05-19 · unverdicted · none · ref 13 · 2 links · internal anchor
TextAlign uses a hierarchical VLM reward for preference alignment to boost text accuracy in generative models like FLUX.1-dev.
NEWTON: Agentic Planning for Physically Grounded Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 17 · internal anchor
NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 33 · internal anchor
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos cs.CV · 2026-05-18 · unverdicted · none · ref 15 · internal anchor
MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on VBench and NarrLV.
CAB: Accelerating Flow and Diffusion Sampling via Rectification and Corrected Adams-Bashforth cs.CV · 2026-05-16 · conditional · none · ref 42 · 2 links · internal anchor
CAB accelerates flow and diffusion sampling via rectification to a common coordinate system followed by a corrected Adams-Bashforth multistep method that achieves third-order local truncation error while improving quality at low NFEs.

HunyuanVideo: A Systematic Framework For Large Video Generative Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer