EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

Alexandre Alahi; Lan Feng; Mariam Hassan; Po-Chien Luan; Wentao Pan; Wuyang Li; Yang Gao

arxiv: 2605.15042 · v1 · pith:D722KMCPnew · submitted 2026-05-14 · 💻 cs.CV · cs.AI

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

Wuyang Li , Yang Gao , Mariam Hassan , Lan Feng , Wentao Pan , Po-Chien Luan , Alexandre Alahi This is my paper

Pith reviewed 2026-06-30 21:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationhuman animationlong-form videolatent diffusiondrift mitigationflow matchingLoRA adaptation

0 comments

The pith

EverAnimate generates minute-scale human animations without accumulating quality or identity drift by restoring latent flow trajectories across chunks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EverAnimate as a lightweight post-training approach for producing extended animated videos while keeping visual quality and character consistency intact. Chunk-wise generation of dynamic human motion against static scenes tends to cause progressive degradation in backgrounds and loss of identity over time. The method counters this by anchoring each new chunk to a running latent context memory that carries forward identity and motion information, combined with an adjustment to the sampling velocity that implicitly restores fidelity inside each chunk. Only small LoRA updates are required on top of an existing model. Reported metrics show the advantage growing with duration, reaching double-digit gains in standard image-quality measures at 90-second lengths.

Core claim

EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of Persistent Latent Propagation that maintains identity and motion across chunks and Restorative Flow Matching that applies velocity adjustment during sampling to improve within-chunk fidelity.

What carries the argument

Persistent latent context memory that propagates identity and motion across generation chunks while Restorative Flow Matching adjusts sampling velocities to restore fidelity.

If this is right

Static background elements remain stable for the full duration of the animation.
Character appearance and viewpoint attributes stay consistent without additional identity-preserving losses.
The same base model can be adapted to long sequences with only a few thousand training steps.
Gains in perceptual metrics widen as sequence length increases from 10 to 90 seconds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-memory anchoring could be tested on non-human subjects or scene-only videos to check generality.
Combining the approach with existing temporal attention layers might further reduce the need for any fine-tuning.
If the velocity adjustment proves robust, it could be applied at inference time to other chunked diffusion pipelines without retraining.

Load-bearing premise

Maintaining a persistent latent context memory across chunks together with velocity adjustment during sampling is enough to stop both quality and semantic drift without creating fresh inconsistencies.

What would settle it

Generate a 3-minute animation with the method and measure whether background PSNR or character identity metrics begin to decline after the 90-second mark.

Figures

Figures reproduced from arXiv: 2605.15042 by Alexandre Alahi, Lan Feng, Mariam Hassan, Po-Chien Luan, Wentao Pan, Wuyang Li, Yang Gao.

**Figure 1.** Figure 1: (a) Existing human animation methods primarily suffer from two types of drift: low-level quality degradation and high-level identity change. (b) Our method alleviates both issues, achieving stable animation. The bottom row provides zoomed-in views of the facial region and the background. with commonly adopted anti-drifting methods, such as attention sinks [15] 1 , error recycling [16], and sliding-window [… view at source ↗

**Figure 2.** Figure 2: Illustration of errors in long-range animation videos. We visualize (a) VAE round-trip reconstructions and (b) DiT self-attention maps across video chunks at different lengths. In each attention map, the 1st col. (highlighted in red) shows how video tokens attend to the global reference frame. Although most tokens correctly attend to this reference (i.e., forming an attention sink [15, 46]), the generated … view at source ↗

**Figure 3.** Figure 3: Overview of EverAnimate. Train: from a context chunk V (1), we extract motion/identity memories Mid/mot and train the model to generate the next chunk V (2) with restorative flow matching. Test: We roll out chunk-by-chunk in the latent space without decoding frames between chunks. current generation. (a) Persistent Latent Propagation constructs motion and identity memory that will be propagated from V (1) … view at source ↗

**Figure 4.** Figure 4: Effects of memory augmentation Tid. Memory and Control Injection. We inject the memories and controls into the DiT input in two steps. First, we form the context tokens by concatenating motion/identity memories (plus a null pad to match temporal length). For Wan-style backbones, we build up the full memory as follows, Mctx = Concatt(Mmot, Mid, Xpad), (2) where Xpad is a null latent block so that Mctx has … view at source ↗

**Figure 5.** Figure 5: RFM illustration. Comparison between our RFM and the standard FM baseline. Given the memory-anchored input, we train the denoising flow not only to follow the clean flow trajectory but also to recover from small intratrajectory deviations during rollout. Flow Matching (FM). We first recall the standard FM formulation (Fig. 5a) for the target chunk V (2). Since the chunk index is fixed in this subsection,… view at source ↗

**Figure 6.** Figure 6: Effects of the reschedule. Comparison of the training stability with and without rescheduling λ(t), i.e., Eq. (9) vs. Eq. (11). Our rescheduling design, shown in orange, can stabilize training. Eq. (9) shows that RFM can be written as the standard FM velocity plus a correction that pulls the perturbed state back toward the clean path. However, we find that the exact coefficient 1 1−t is poorly conditione… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison with state-of-the-art methods. See supplementary material for [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EverAnimate pairs persistent latent memory with flow velocity tweaks to cut drift in chunked human video gen, and the longer-horizon gains look plausible on the reported metrics.

read the letter

The paper's main contribution is a post-training recipe that keeps a running latent context across generation chunks while adjusting sampling velocities to pull drifted trajectories back on track. They apply this with lightweight LoRA updates and show metric lifts that widen from 10-second clips to 90-second ones. The abstract frames the two pieces—persistent propagation and restorative matching—as a joint fix for both pixel-level degradation and identity/view drift in human motion against static backgrounds.

What the work does cleanly is target a real deployment pain point: existing chunked pipelines lose coherence fast on minute-scale output. Reporting that the advantage grows with length is useful, and the claim that only LoRA is needed keeps the bar low for adoption.

The soft spots are the usual ones for an abstract-heavy submission. No ablations appear that isolate whether the memory or the velocity adjustment drives most of the lift, and there is no discussion of whether the persistent context itself can seed new inconsistencies at longer scales. The quantitative numbers are given without controls for seed variance, dataset splits, or alternative baselines, so the size of the real effect is still unclear. The full manuscript may fill these gaps, but they are not visible here.

This is for groups already working on long video synthesis who need a drop-in drift mitigator rather than a full model redesign. If the experiments in the paper include the missing checks and the method stays stable under different random seeds, it is worth a serious referee read. I would send it out for review.

Referee Report

2 major / 0 minor

Summary. The paper proposes EverAnimate, a post-training method for minute-scale human video animation that uses Persistent Latent Propagation (maintaining context memory across chunks) and Restorative Flow Matching (velocity adjustment during sampling) with lightweight LoRA tuning to mitigate low-level quality drift and high-level semantic drift. It claims consistent outperformance over SOTA long-animation methods, with metric gains (PSNR/SSIM up 8%/7%, LPIPS/FID down 22%/11% at 10s; larger gains of 15%/15% and 32%/27% at 90s).

Significance. If the reported gains are robust, the work would offer a practical, low-cost route to long-horizon animation without full retraining, addressing a recognized bottleneck in chunked video generation. The combination of persistent latent memory and implicit restoration during sampling is a plausible direction, though its independence from the base model parameters remains to be demonstrated.

major comments (2)

[Abstract] Abstract: the central quantitative claim (specific PSNR/SSIM/LPIPS/FID deltas at 10 s and 90 s) is presented without any description of dataset composition, number of test sequences, random-seed statistics, or baseline re-implementations, rendering the outperformance claim unverifiable from the supplied material.
[Abstract] Abstract / Methods: no equations, pseudocode, or ablation isolating Persistent Latent Propagation versus Restorative Flow Matching are supplied, so it is impossible to determine whether the two mechanisms are independent or whether the velocity adjustment reduces to quantities already fitted by the base model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve verifiability and technical detail.

read point-by-point responses

Referee: [Abstract] Abstract: the central quantitative claim (specific PSNR/SSIM/LPIPS/FID deltas at 10 s and 90 s) is presented without any description of dataset composition, number of test sequences, random-seed statistics, or baseline re-implementations, rendering the outperformance claim unverifiable from the supplied material.

Authors: We agree that the abstract should include sufficient context to support the quantitative claims. In the revision we will expand the abstract with a brief description of the evaluation dataset, number of test sequences, statistical protocol (including random seeds), and confirmation that baselines were re-implemented from their original publications, while retaining full details in the Experiments section. revision: yes
Referee: [Abstract] Abstract / Methods: no equations, pseudocode, or ablation isolating Persistent Latent Propagation versus Restorative Flow Matching are supplied, so it is impossible to determine whether the two mechanisms are independent or whether the velocity adjustment reduces to quantities already fitted by the base model.

Authors: The current manuscript provides textual descriptions of the two mechanisms but lacks formal equations, pseudocode, and isolating ablations. We will add the mathematical formulations for Persistent Latent Propagation and Restorative Flow Matching, include pseudocode for the overall procedure, and insert an ablation study that isolates the contribution of each component. This will demonstrate that the velocity adjustment introduces an additional term during sampling that is independent of the base model. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical post-training method (Persistent Latent Propagation + Restorative Flow Matching with lightweight LoRA) whose performance claims are supported by reported metric improvements on short- and long-horizon video generation tasks. No equations, derivations, or first-principles results are supplied that reduce the claimed gains to quantities defined by the method's own fitted parameters or by self-citation chains. The central mechanisms are described as independent additions to an existing generation pipeline rather than tautological redefinitions of the evaluation targets. Because the manuscript contains no load-bearing self-citations, fitted-input predictions, or ansatz smuggling, the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, background axioms, or newly postulated physical entities are described; the method relies on standard latent diffusion concepts and LoRA adaptation whose specific hyper-parameters are not enumerated.

pith-pipeline@v0.9.1-grok · 5760 in / 1209 out tokens · 28161 ms · 2026-06-30T21:19:50.952475+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Drift-Resistant Navigation World Model with Anchored Epipolar Guidance
cs.CV 2026-05 unverdicted novelty 5.0

A generative navigation world model that uses sparse anchored rollout with epipolar constraints to reduce perceptual and geometric drift.

Reference graph

Works this paper leans on

57 extracted references · 35 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[2]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Everybody dance now

Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5933–5942, 2019

2019
[4]

Video-to-Video Synthesis

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis.arXiv preprint arXiv:1808.06601, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Magicanimate: Temporally consistent human image animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024

2024
[7]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation.arXiv preprint arXiv:2311.17117, 2023

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation.arXiv preprint arXiv:2311.17117, 2023

arXiv 2023
[8]

Tcan: Animating human images with temporally consistent pose guidance using diffusion models.arXiv preprint arXiv:2407.09012, 2024

Jeongho Kim, Min-Jung Kim, Junsoo Lee, and Jaegul Choo. Tcan: Animating human images with temporally consistent pose guidance using diffusion models.arXiv preprint arXiv:2407.09012, 2024

work page arXiv 2024
[9]

Posecrafter: One-shot personalized video synthesis following flexible pose control

Yong Zhong, Min Zhao, Zebin You, Xiaofeng Yu, Changwang Zhang, and Chongxuan Li. Posecrafter: One-shot personalized video synthesis following flexible pose control. InEuropean conference on computer vision, pages 243–260. Springer, 2024

2024
[10]

Humandit: Pose-guided diffusion transformer for long-form human motion video generation.arXiv preprint arXiv:2502.04847, 2025

Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Humandit: Pose-guided diffusion transformer for long-form human motion video generation.arXiv preprint arXiv:2502.04847, 2025. 10

work page arXiv 2025
[11]

One-to-all animation: Alignment-free character animation and image pose transfer

Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang, Lijing Lu, Kai Hu, and Jiangning Zhang. One-to-all animation: Alignment-free character animation and image pose transfer. arXiv preprint arXiv:2511.22940, 2025

Pith/arXiv arXiv 2025
[12]

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gangshan Wu, Haolan Chen, Yu Xu, Limin Wang, and Kai Ma. Steadydancer: Harmonized and coherent human image animation with first-frame preservation.arXiv preprint arXiv:2511.19320, 2025. URLhttps://arxiv.org/abs/2511.19320

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Animate-x: Universal character image animation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024

Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, and Ming Yang. Animate-x: Universal character image animation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024

work page arXiv 2024
[15]

Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

Pith/arXiv arXiv 2023
[16]

Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

arXiv 2025
[17]

First order motion model for image animation.Advances in neural information processing systems, 32, 2019

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation.Advances in neural information processing systems, 32, 2019

2019
[19]

Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

arXiv 2024
[20]

Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025

Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, and Nong Sang. Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025

work page arXiv 2025
[21]

RealisDance-DiT: Simple yet strong baseline towards controllable character animation in the wild.arXiv preprint arXiv:2504.14977, 2025

Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Weihua Chen, Wei Jiang, and Fan Wang. RealisDance-DiT: Simple yet strong baseline towards controllable character animation in the wild.arXiv preprint arXiv:2504.14977, 2025

work page arXiv 2025
[22]

StableAnimator: High-quality identity-preserving human image animation

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. StableAnimator: High-quality identity-preserving human image animation. InCVPR, 2025

2025
[23]

Wan- Animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, et al. Wan- Animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

work page arXiv 2025
[24]

Scail: Towards studio-grade character animation via in-context learning of 3d-consistent pose representations.arXiv preprint arXiv:2512.05905, 2025

Wenhao Yan, Sheng Ye, Zhuoyi Yang, Jiayan Teng, ZhenHui Dong, Kairui Wen, Xiaotao Gu, Yong-Jin Liu, and Jie Tang. Scail: Towards studio-grade character animation via in-context learning of 3d-consistent pose representations.arXiv preprint arXiv:2512.05905, 2025

work page arXiv 2025
[25]

Videox-fun: A video generation pipeline for diffusion transformer, 2026

aigc apps. Videox-fun: A video generation pipeline for diffusion transformer, 2026. URL https://github.com/aigc-apps/VideoX-Fun

2026
[26]

Deformable gaussian occupancy: Decoupling rigid and nonrigid motion with factorized distillation

Yang Gao, Wuyang Li, Po-Chien Luan, and Alexandre Alahi. Deformable gaussian occupancy: Decoupling rigid and nonrigid motion with factorized distillation. InCVPR, 2026

2026
[27]

Echomimicv2: Towards striking, simplified, and semi-body human animation.arXiv preprint arXiv:2411.10061, 2024

Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation.arXiv preprint arXiv:2411.10061, 2024. URL https://arxiv.org/abs/2411.10061. 11

work page arXiv 2024
[28]

Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

work page arXiv 2025
[29]

HunyuanVideo-Avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025

Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, and Qinglin Lu. HunyuanVideo-Avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025

work page arXiv 2025
[30]

PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

Jingxuan He, Busheng Su, and Finn Wong. PoseGen: In-context LoRA finetuning for pose- controllable long human video generation.arXiv preprint arXiv:2508.05091, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

High-fidelity and long-duration human image animation with diffusion transformer.arXiv preprint arXiv:2512.21905, 2025

Shen Zheng, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Xingpei Ma, Junjie Cao, Hanfeng Zhao, Qiang Zhang, Shunsi Zhang, and Xiao-Ping Zhang. High-fidelity and long-duration human image animation with diffusion transformer.arXiv preprint arXiv:2512.21905, 2025

work page arXiv 2025
[32]

Lookahead anchoring: Preserving character identity in audio-driven human animation.arXiv preprint arXiv:2510.23581, 2025

Junyoung Seo, Rodrigo Mira, Alexandros Haliassos, Stella Bounareli, Honglie Chen, Linh Tran, Seungryong Kim, Zoe Landgraf, and Jie Shen. Lookahead anchoring: Preserving character identity in audio-driven human animation.arXiv preprint arXiv:2510.23581, 2025

work page arXiv 2025
[33]

AnimateAnywhere: Rouse the background in human image animation

Xiaoyu Liu, Mingshuai Yao, Yabo Zhang, Xianhui Lin, Peiran Ren, Xiaoming Li, Ming Liu, and Wangmeng Zuo. AnimateAnywhere: Rouse the background in human image animation. arXiv preprint arXiv:2504.19834, 2025

work page arXiv 2025
[34]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

MAGI-1: Autoregressive Video Generation at Scale

Sand.ai. MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, et al. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

URLhttps://arxiv.org/abs/2406.16863

work page arXiv
[42]

Social-mamba: Efficient human trajectory forecasting with state-space models

Po-Chien Luan, Wuyang Li, Yang Gao, and Alexandre Alahi. Social-mamba: Efficient human trajectory forecasting with state-space models. 2025

2025
[43]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

2024
[44]

Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In CVPR, 2025

2025
[45]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2025

2025
[46]

Rolling forcing: Autoregressive long video diffusion in real time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026. 12

2026
[47]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025. URL https://arxiv.org/abs/2509.22622

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Packing input frame context in next-frame prediction models for video generation

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. InNeurIPS, 2025

2025
[49]

Stable video infinity: Infinite-length video generation with error recycling

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling. InICLR, 2026

2026
[50]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

work page arXiv 2026
[51]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

LongCat-Video technical report.arXiv preprint arXiv:2510.22200, 2025

Meituan LongCat Team. LongCat-Video technical report.arXiv preprint arXiv:2510.22200, 2025

work page arXiv 2025
[53]

Malt diffusion: Memory-augmented latent transformers for any-length video generation.arXiv preprint arXiv:2502.12632, 2025

Sihyun Yu, Meera Hahn, Dan Kondratyuk, Jinwoo Shin, Agrim Gupta, Jose Lezama, Irfan Essa, David Ross, and Jonathan Huang. Malt diffusion: Memory-augmented latent transformers for any-length video generation.arXiv preprint arXiv:2502.12632, 2025

work page arXiv 2025
[54]

TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

WorldMem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. WorldMem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

work page arXiv 2025
[56]

Context forcing: Consistent autoregressive video generation with long context,

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

work page arXiv 2026
[57]

Diffsynth-studio

ModelScope Team. Diffsynth-studio. GitHub repository, 2024. URL https://github.com/ modelscope/DiffSynth-Studio. Accessed: 2026-05-04

2024
[58]

Champ: Controllable and consistent human image animation with 3d parametric guidance

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image animation with 3d parametric guidance. InEuropean Conference on Computer Vision, pages 145–162. Springer, 2024

2024
[59]

Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019

Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019

work page arXiv 1910
[60]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InAdvances in Neural Information Processing Systems, 2022. 13

2022

[1] [2]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [3]

Everybody dance now

Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5933–5942, 2019

2019

[3] [4]

Video-to-Video Synthesis

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis.arXiv preprint arXiv:1808.06601, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [5]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [6]

Magicanimate: Temporally consistent human image animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024

2024

[6] [7]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation.arXiv preprint arXiv:2311.17117, 2023

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation.arXiv preprint arXiv:2311.17117, 2023

arXiv 2023

[7] [8]

Tcan: Animating human images with temporally consistent pose guidance using diffusion models.arXiv preprint arXiv:2407.09012, 2024

Jeongho Kim, Min-Jung Kim, Junsoo Lee, and Jaegul Choo. Tcan: Animating human images with temporally consistent pose guidance using diffusion models.arXiv preprint arXiv:2407.09012, 2024

work page arXiv 2024

[8] [9]

Posecrafter: One-shot personalized video synthesis following flexible pose control

Yong Zhong, Min Zhao, Zebin You, Xiaofeng Yu, Changwang Zhang, and Chongxuan Li. Posecrafter: One-shot personalized video synthesis following flexible pose control. InEuropean conference on computer vision, pages 243–260. Springer, 2024

2024

[9] [10]

Humandit: Pose-guided diffusion transformer for long-form human motion video generation.arXiv preprint arXiv:2502.04847, 2025

Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Humandit: Pose-guided diffusion transformer for long-form human motion video generation.arXiv preprint arXiv:2502.04847, 2025. 10

work page arXiv 2025

[10] [11]

One-to-all animation: Alignment-free character animation and image pose transfer

Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang, Lijing Lu, Kai Hu, and Jiangning Zhang. One-to-all animation: Alignment-free character animation and image pose transfer. arXiv preprint arXiv:2511.22940, 2025

Pith/arXiv arXiv 2025

[11] [12]

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gangshan Wu, Haolan Chen, Yu Xu, Limin Wang, and Kai Ma. Steadydancer: Harmonized and coherent human image animation with first-frame preservation.arXiv preprint arXiv:2511.19320, 2025. URLhttps://arxiv.org/abs/2511.19320

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [13]

Animate-x: Universal character image animation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024

Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, and Ming Yang. Animate-x: Universal character image animation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024

work page arXiv 2024

[13] [15]

Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

Pith/arXiv arXiv 2023

[14] [16]

Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

arXiv 2025

[15] [17]

First order motion model for image animation.Advances in neural information processing systems, 32, 2019

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation.Advances in neural information processing systems, 32, 2019

2019

[16] [19]

Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

arXiv 2024

[17] [20]

Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025

Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, and Nong Sang. Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025

work page arXiv 2025

[18] [21]

RealisDance-DiT: Simple yet strong baseline towards controllable character animation in the wild.arXiv preprint arXiv:2504.14977, 2025

Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Weihua Chen, Wei Jiang, and Fan Wang. RealisDance-DiT: Simple yet strong baseline towards controllable character animation in the wild.arXiv preprint arXiv:2504.14977, 2025

work page arXiv 2025

[19] [22]

StableAnimator: High-quality identity-preserving human image animation

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. StableAnimator: High-quality identity-preserving human image animation. InCVPR, 2025

2025

[20] [23]

Wan- Animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, et al. Wan- Animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

work page arXiv 2025

[21] [24]

Scail: Towards studio-grade character animation via in-context learning of 3d-consistent pose representations.arXiv preprint arXiv:2512.05905, 2025

Wenhao Yan, Sheng Ye, Zhuoyi Yang, Jiayan Teng, ZhenHui Dong, Kairui Wen, Xiaotao Gu, Yong-Jin Liu, and Jie Tang. Scail: Towards studio-grade character animation via in-context learning of 3d-consistent pose representations.arXiv preprint arXiv:2512.05905, 2025

work page arXiv 2025

[22] [25]

Videox-fun: A video generation pipeline for diffusion transformer, 2026

aigc apps. Videox-fun: A video generation pipeline for diffusion transformer, 2026. URL https://github.com/aigc-apps/VideoX-Fun

2026

[23] [26]

Deformable gaussian occupancy: Decoupling rigid and nonrigid motion with factorized distillation

Yang Gao, Wuyang Li, Po-Chien Luan, and Alexandre Alahi. Deformable gaussian occupancy: Decoupling rigid and nonrigid motion with factorized distillation. InCVPR, 2026

2026

[24] [27]

Echomimicv2: Towards striking, simplified, and semi-body human animation.arXiv preprint arXiv:2411.10061, 2024

Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation.arXiv preprint arXiv:2411.10061, 2024. URL https://arxiv.org/abs/2411.10061. 11

work page arXiv 2024

[25] [28]

Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

work page arXiv 2025

[26] [29]

HunyuanVideo-Avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025

Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, and Qinglin Lu. HunyuanVideo-Avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025

work page arXiv 2025

[27] [30]

PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

Jingxuan He, Busheng Su, and Finn Wong. PoseGen: In-context LoRA finetuning for pose- controllable long human video generation.arXiv preprint arXiv:2508.05091, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [31]

High-fidelity and long-duration human image animation with diffusion transformer.arXiv preprint arXiv:2512.21905, 2025

Shen Zheng, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Xingpei Ma, Junjie Cao, Hanfeng Zhao, Qiang Zhang, Shunsi Zhang, and Xiao-Ping Zhang. High-fidelity and long-duration human image animation with diffusion transformer.arXiv preprint arXiv:2512.21905, 2025

work page arXiv 2025

[29] [32]

Lookahead anchoring: Preserving character identity in audio-driven human animation.arXiv preprint arXiv:2510.23581, 2025

Junyoung Seo, Rodrigo Mira, Alexandros Haliassos, Stella Bounareli, Honglie Chen, Linh Tran, Seungryong Kim, Zoe Landgraf, and Jie Shen. Lookahead anchoring: Preserving character identity in audio-driven human animation.arXiv preprint arXiv:2510.23581, 2025

work page arXiv 2025

[30] [33]

AnimateAnywhere: Rouse the background in human image animation

Xiaoyu Liu, Mingshuai Yao, Yabo Zhang, Xianhui Lin, Peiran Ren, Xiaoming Li, Ming Liu, and Wangmeng Zuo. AnimateAnywhere: Rouse the background in human image animation. arXiv preprint arXiv:2504.19834, 2025

work page arXiv 2025

[31] [34]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [35]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [36]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [37]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [38]

MAGI-1: Autoregressive Video Generation at Scale

Sand.ai. MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [39]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, et al. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [41]

URLhttps://arxiv.org/abs/2406.16863

work page arXiv

[38] [42]

Social-mamba: Efficient human trajectory forecasting with state-space models

Po-Chien Luan, Wuyang Li, Yang Gao, and Alexandre Alahi. Social-mamba: Efficient human trajectory forecasting with state-space models. 2025

2025

[39] [43]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

2024

[40] [44]

Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In CVPR, 2025

2025

[41] [45]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2025

2025

[42] [46]

Rolling forcing: Autoregressive long video diffusion in real time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026. 12

2026

[43] [47]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025. URL https://arxiv.org/abs/2509.22622

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [48]

Packing input frame context in next-frame prediction models for video generation

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. InNeurIPS, 2025

2025

[45] [49]

Stable video infinity: Infinite-length video generation with error recycling

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling. InICLR, 2026

2026

[46] [50]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

work page arXiv 2026

[47] [51]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [52]

LongCat-Video technical report.arXiv preprint arXiv:2510.22200, 2025

Meituan LongCat Team. LongCat-Video technical report.arXiv preprint arXiv:2510.22200, 2025

work page arXiv 2025

[49] [53]

Malt diffusion: Memory-augmented latent transformers for any-length video generation.arXiv preprint arXiv:2502.12632, 2025

Sihyun Yu, Meera Hahn, Dan Kondratyuk, Jinwoo Shin, Agrim Gupta, Jose Lezama, Irfan Essa, David Ross, and Jonathan Huang. Malt diffusion: Memory-augmented latent transformers for any-length video generation.arXiv preprint arXiv:2502.12632, 2025

work page arXiv 2025

[50] [54]

TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [55]

WorldMem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. WorldMem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

work page arXiv 2025

[52] [56]

Context forcing: Consistent autoregressive video generation with long context,

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

work page arXiv 2026

[53] [57]

Diffsynth-studio

ModelScope Team. Diffsynth-studio. GitHub repository, 2024. URL https://github.com/ modelscope/DiffSynth-Studio. Accessed: 2026-05-04

2024

[54] [58]

Champ: Controllable and consistent human image animation with 3d parametric guidance

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image animation with 3d parametric guidance. InEuropean Conference on Computer Vision, pages 145–162. Springer, 2024

2024

[55] [59]

Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019

Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019

work page arXiv 1910

[56] [60]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [61]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InAdvances in Neural Information Processing Systems, 2022. 13

2022