Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

(2) AIPARK); Chulmin Park (2); Heeseong Shin (1); Jinhyuk Jang (1); Joungbin Lee (1); Jung Yi (1); Paul Hyunbin Cho (1); SeokYoung Lee (1); Seungryong Kim (1) ((1) KAIST AI; Siyoon Jin (1)

arxiv: 2606.11180 · v1 · pith:ZS7YHV76new · submitted 2026-06-09 · 💻 cs.CV

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Paul Hyunbin Cho (1) , Jinhyuk Jang (1) , SeokYoung Lee (1) , Joungbin Lee (1) , Siyoon Jin (1) , Heeseong Shin (1) , Jung Yi (1) , Yunjin Park (2)

show 3 more authors

Chulmin Park (2) Seungryong Kim (1) ((1) KAIST AI (2) AIPARK)

This is my paper

Pith reviewed 2026-06-27 13:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords lip synchronizationautoregressive diffusionmodel distillationreal-time video generationaudio-visual alignmentcausal modelsdiffusion acceleration

0 comments

The pith

Lip Forcing distills a 14B bidirectional diffusion teacher into causal students that generate lip-synced video in two denoising steps without CFG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a large bidirectional audio-conditioned video diffusion model for lip synchronization can be distilled into smaller autoregressive causal students capable of real-time inference. It does so by analyzing the teacher's denoising trajectories to identify a CFG-based fidelity-sync tradeoff and then translating that into three specific components for the students. A sympathetic reader would care because full-sequence bidirectional attention and many denoising steps have kept high-quality lip-sync models too slow for streaming or live use. If the distillation holds, the approach makes practical deployment feasible at 31 FPS with sub-millisecond latency to the first frame while preserving reference fidelity and alignment.

Core claim

Lip Forcing distills the 14B teacher into 1.3B and 14B causal students using three components derived from lip-sync-specific teacher-trajectory analysis: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. The students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time performance. The 1.3B student reaches 31 FPS (17.6 times faster than its same-scale bidirectional model) and the 14B student runs 39.8 times faster than its teacher at comparable reference fidelity, with sub-millisecond time-to-first-frame at both scales.

What carries the argument

The lip-sync-specific teacher-trajectory analysis that identifies the CFG fidelity-sync tradeoff and produces the three components (Sync-Window DMD, two-step schedule, SyncNet reward) for distilling bidirectional teachers into causal autoregressive students.

If this is right

The 1.3B student crosses into real-time streaming at 31 FPS.
The 14B student achieves a 39.8 times speedup over its teacher at comparable reference fidelity.
Time-to-first-frame falls below one millisecond at both model scales.
Real-time lip synchronization becomes feasible for streaming applications without inference-time classifier-free guidance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation pattern could apply to other conditional video-to-video tasks that require causal generation rather than full bidirectional context.
Longer test sequences might expose limits in maintaining temporal consistency that shorter evaluation clips do not reveal.
The sub-millisecond latency opens direct use in interactive settings such as live dubbing or avatar systems where bidirectional models cannot run.

Load-bearing premise

The assumption that the lip-sync-specific trajectory analysis and its three derived components transfer quality from the bidirectional teacher to the causal students without significant degradation in long-sequence coherence or audio-visual alignment.

What would settle it

Measuring audio-visual synchronization error and visual coherence metrics on video sequences substantially longer than the training clips to check whether degradation appears relative to the bidirectional teacher.

Figures

Figures reproduced from arXiv: 2606.11180 by (2) AIPARK), Chulmin Park (2), Heeseong Shin (1), Jinhyuk Jang (1), Joungbin Lee (1), Jung Yi (1), Paul Hyunbin Cho (1), SeokYoung Lee (1), Seungryong Kim (1) ((1) KAIST AI, Siyoon Jin (1), Yunjin Park (2).

**Figure 1.** Figure 1: Lip Forcing. A streaming model for real-time lip synchronization that produces photorealistic, accurately lip-synced video at up to 31 FPS with low latency and memory. Right: both student scales lie on the throughput–FVD Pareto frontier, ahead of prior diffusion lip-sync methods. Abstract Diffusion-based lip synchronization models achieve strong visual quality and audiovisual alignment, but full-sequence… view at source ↗

**Figure 2.** Figure 2: Trajectory analysis of the 14B teacher. Bands are ±1 SE. (a) CFG fidelity–sync tradeoff: CFG (s=4.5, red) improves Sync-C but worsens reference fidelity (LPIPS), while no-CFG (s=1.0, navy) shows the opposite trend. (b) Euler-step 2×2 factorial over schedules (s0, s1), plotted against the second-step landing j1: mixed schedules recover most of the sync gap of the CFG-guided ceiling at landings near step 30.… view at source ↗

**Figure 3.** Figure 3: Why few-step distillation needs trajectory-level care. Two HDTF [52] samples, each showing the 1-step prediction from pure noise, 50-step ODE final output, and ground truth, respectively. Even a one-step prediction preserves coarse facial structure and approximate mouth timing, but it loses the fine articulation and audio-visual synchronization recovered by the full 50-step teacher. Lip Forcing compresses… view at source ↗

**Figure 4.** Figure 4: Fixed-CFG endpoints vs. diagnostic operating point (green diamond, at ODE step j = 30). n=10, ±1 SE. SSIM and 4-metric in App. C.2. The strong one-step prediction suggests that the teacher does not require dense trajectory traversal for coarse lip timing and structure, but the remaining detail gap motivates asking where a second denoising step should land. We therefore conduct an Euler-step analysis to … view at source ↗

**Figure 5.** Figure 5: Architecture of Lip Forcing. The causal student denoises Gaussian noise with lip-sync conditions, producing a chunk-wise causal rollout via the two-step schedule (Sec. 4.4). The clean prediction xˆ0 is supervised by the DMD [48, 47] gradient (Eq. 4) between a frozen 14B teacher and a trainable fake-score critic, with the teacher’s CFG gated by the windowed schedule sSW of Eq. 6. The same xˆ0 is decoded by … view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on HDTF. Each row shows the same source frame rendered by our method, six lip-sync baselines, and the ground truth (GT) at the moment of articulating the bracketed English phoneme. Best viewed zoomed in and in color. Qualitative comparisons against all baselines at matched phoneme-articulation moments are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The CFG fidelity–sync tradeoff (full 4-metric). Per-step mean across n=10 samples; shaded bands are ±1 standard error. Red: CFG-guided teacher (s=4.5); navy: no-CFG teacher (s=1.0). SSIM (mouth) tracks LPIPS, and Sync-D mirrors Sync-C: the same separation between the two trajectories observed in the main figure is reproduced on these additional metrics. 0 20 40 Target ODE step 0.850 0.875 0.900 0.925 SSIM … view at source ↗

**Figure 8.** Figure 8: Euler-step 2 × 2 factorial (full 4-metric). Per-step mean across n=10 samples; shaded bands are ±1 standard error. Each trace is one cell of (s0, s1). The reference-axis pattern (cells sharing s0 converge by mid-trajectory) holds on SSIM as well as LPIPS; the sync-axis pattern (single-CFG cells close most of the gap to CFG→CFG around step 30, then diverge outside the mid-trajectory window) holds on Sync-D … view at source ↗

**Figure 9.** Figure 9: Fixed-CFG endpoints vs. schedule operating point (full 4-panel). Step-49 endpoints of fixed-CFG sweeps at s ∈ {1.0, 3.0, 4.5, 6.0} as open circles; the no-CFG→CFG Euler-step operating point at j=30 as a filled green diamond. Both axes carry ±1 SE error bars on n=10 samples. Axes are oriented so up-right is favorable (LPIPS, Sync-D inverted). The Sync-D panels (the right column) tell the same story as the S… view at source ↗

**Figure 10.** Figure 10: CFG fidelity–sync tradeoff, audio-only drop mode. Audio-only counterpart of main [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Euler-step CFG factorial, audio-only drop mode. Audio-only counterpart of main [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Trajectory plateau zoom around the joint reference-sync optimum. Per-step means on the mouth region across n=10 samples; shaded bands are ±1 standard error. Same four Eulerstep cells (s0, s1) as main [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Streaming attention sink and dynamic RoPE. Cache state across two consecutive chunks under our streaming setup: 1-frame sink plus a 6-frame rolling window comprising one cached past block of 3 frames and the current 3-frame chunk being denoised, for a total cache size of 7 frames. Boxes are colored by region (orange = sink, blue = cached past block, green = current chunk being denoised); numbers inside bo… view at source ↗

**Figure 14.** Figure 14: Throughput–FVD Pareto frontier across all baselines on HDTF. Companion to the diffusion-only chart in the main paper ( [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Long-video qualitative results on HDTFlong. Two identities, each rolled out to t=180 s and sampled every 30 s, comparing ground truth, Lip Forcing, and the strongest baseline X-Dub at consistent timestamps. Frame quality, identity, and background remain stable across the full 3-minute rollout under Lip Forcing’s causal AR streaming, well beyond the 81-frame (∼3.24 s) training chunk. E.5 Cross-identity eva… view at source ↗

**Figure 16.** Figure 16: Cross-identity qualitative results on HDTF. Two source clips are driven by audio from a different speaker (top row, Audio Source); columns mark the moments at which the highlighted English phoneme is articulated. Each column compares Wav2Lip, VideoReTalking, Diff2Lip, X-Dub, MuseTalk, LatentSync, and Lip Forcing against the same source frame. Lip motion in Lip Forcing follows the driving audio rather than… view at source ↗

**Figure 17.** Figure 17: Additional qualitative results from the Hallo3 and HDTF test sets. [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Additional qualitative results from the TalkVid test set. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

read the original abstract

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, $17.6\times$ faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs $39.8\times$ faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lip Forcing shows a practical distillation route from bidirectional diffusion to two-step autoregressive lip sync with real speed gains, but the abstract leaves long-horizon coherence and evaluation details unaddressed.

read the letter

The core advance is distilling a 14B bidirectional teacher into causal students that generate lip-sync video chunks in two denoising steps without CFG. This yields the reported 31 FPS on the 1.3B model (17.6× faster than same-scale bidirectional) and 39.8× speedup on the 14B student, plus sub-millisecond time-to-first-frame. The teacher-trajectory analysis that produced Sync-Window DMD, the two-step schedule, and the SyncNet reward is the part that feels new and directly tied to the CFG fidelity-sync tradeoff.

The paper does a clean job turning that analysis into concrete inference components and showing the speed crossover into real-time streaming. Those numbers matter for anyone who needs live video output rather than offline generation.

The soft spots sit where the abstract stops. No long-sequence coherence metrics or drift ablations appear, so the claim that quality transfers without cumulative misalignment over extended clips rests on the assumption that local SyncNet rewards plus the derived schedule are enough. The evaluation description is also thin: no explicit metrics, baselines, dataset splits, or variance numbers are given, which makes the "comparable reference fidelity" statement hard to weigh. The circularity risk from distilling against a SyncNet reward trained on similar distributions is noted but not quantified.

This is for groups already working on diffusion video models who want to move them toward streaming. A reader who needs the method details and the missing long-horizon checks will get value once the full results are in. The work deserves a serious referee because the distillation idea is concrete and the speed target is practically relevant, even if the current evidence is still conditional on the full paper filling in the gaps.

Referee Report

2 major / 1 minor

Summary. The paper introduces Lip Forcing, the first autoregressive diffusion method for video-to-video lip synchronization. It distills a 14B bidirectional audio-conditioned video diffusion teacher into causal student models (1.3B and 14B scales) that generate each chunk in only two denoising steps without inference-time CFG. The method is derived from a lip-sync-specific teacher-trajectory analysis identifying a CFG fidelity-sync tradeoff, yielding three components (Sync-Window DMD, two-step inference schedule, SyncNet-based reward). Reported results include the 1.3B student reaching 31 FPS (17.6× faster than same-scale bidirectional) and the 14B student running 39.8× faster than its teacher at comparable reference fidelity, with sub-millisecond time-to-first-frame.

Significance. If the quality transfer holds, the work would enable practical real-time diffusion-based V2V lip synchronization by overcoming bidirectional attention and multi-step denoising costs. The trajectory analysis and distillation components offer a targeted approach to balancing fidelity and synchronization in few-step causal generation, with potential impact on streaming video applications.

major comments (2)

[Abstract] Abstract: The headline performance claims (31 FPS, 17.6× and 39.8× speedups at comparable reference fidelity) rest on unspecified validation; no details are provided on the metrics or procedures used to measure reference fidelity and synchronization, baseline comparisons, error bars, dataset splits, or long-sequence evaluation.
[Abstract] Abstract: The central assumption that the CFG fidelity-sync tradeoff analysis and its three derived components (Sync-Window DMD, two-step schedule, SyncNet reward) successfully transfer bidirectional teacher quality to causal students without degradation in long-sequence coherence or audio-visual alignment is load-bearing for the speedups, yet no quantitative long-horizon metrics or ablations on coherence drift over sequences longer than training chunks are reported.

minor comments (1)

[Abstract] Abstract: The claim of being 'to our knowledge the first autoregressive diffusion method for V2V lip synchronization' would benefit from explicit citations to related autoregressive or few-step diffusion works for context.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the review and the identification of areas where the abstract could better contextualize our claims. We respond to each major comment below with references to the manuscript content and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance claims (31 FPS, 17.6× and 39.8× speedups at comparable reference fidelity) rest on unspecified validation; no details are provided on the metrics or procedures used to measure reference fidelity and synchronization, baseline comparisons, error bars, dataset splits, or long-sequence evaluation.

Authors: The abstract is a high-level summary; full specification of metrics (SyncNet confidence for synchronization, FID/LPIPS for reference fidelity), evaluation procedures, baseline models, dataset splits, and error bars from repeated runs appear in Sections 4.1–4.2 and the supplementary material. We will revise the abstract to include a short clause directing readers to these sections for the validation protocol. revision: yes
Referee: [Abstract] Abstract: The central assumption that the CFG fidelity-sync tradeoff analysis and its three derived components (Sync-Window DMD, two-step schedule, SyncNet reward) successfully transfer bidirectional teacher quality to causal students without degradation in long-sequence coherence or audio-visual alignment is load-bearing for the speedups, yet no quantitative long-horizon metrics or ablations on coherence drift over sequences longer than training chunks are reported.

Authors: Section 3.1 presents the teacher-trajectory analysis that motivates the three components, and Section 4 reports that the distilled students achieve comparable reference fidelity to the teacher at the evaluated chunk lengths. We agree that explicit quantitative ablations measuring coherence drift on sequences substantially longer than the training chunks are not included in the manuscript. revision: partial

standing simulated objections not resolved

No quantitative long-horizon metrics or ablations on coherence drift over sequences longer than training chunks are available in the current work.

Circularity Check

0 steps flagged

Derivation chain is self-contained; no load-bearing step reduces to input by construction

full rationale

The paper's central chain—teacher-trajectory analysis of CFG fidelity-sync tradeoff yielding Sync-Window DMD, two-step schedule, and SyncNet reward, followed by distillation into causal students—relies on an external 14B bidirectional teacher and empirical analysis rather than self-definition or fitted-parameter renaming. No equation or component is shown to equal its input by construction, and SyncNet appears as an external reward model. Claims of speed and fidelity are presented as experimental outcomes, not definitional. This is the common non-circular case for distillation papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the existence of a high-quality 14B bidirectional teacher model and the validity of distilling its behavior into causal students via the three analysis-derived components. No explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption A 14B audio-conditioned bidirectional video diffusion model exists and produces high-quality lip synchronization outputs suitable for distillation.
The entire pipeline begins with distilling this teacher; its quality and availability are presupposed.
domain assumption The CFG fidelity-sync tradeoff observed in the teacher trajectory generalizes to the student models and can be exploited via Sync-Window DMD and SyncNet reward.
This analysis is presented as the source of the three key components.

pith-pipeline@v0.9.1-grok · 5841 in / 1535 out tokens · 16139 ms · 2026-06-27T13:38:27.492943+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 36 canonical work pages · 17 internal anchors

[1]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020. URL https://arxiv.org/abs/2006.11477

work page arXiv 2020
[2]

Taehv: Tiny autoencoder for hunyuan video

Ollin Boer Bohan. Taehv: Tiny autoencoder for hunyuan video. https://github.com/madebyollin/ taehv, 2025

2025
[3]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024
[4]

Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025

Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, and Benyou Wang. Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025. URL https://arxiv.org/ abs/2508.13618

work page arXiv 2025
[5]

VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild, 2022

Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, and Nannan Wang. VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild, 2022. URLhttps://arxiv.org/abs/2211.14758

work page arXiv 2022
[6]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016

2016
[7]

V oxceleb2: Deep speaker recognition

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxceleb2: Deep speaker recognition. In Interspeech 2018, page 1086–1090. ISCA, September 2018. doi: 10.21437/interspeech.2018-1929. URL http://dx.doi.org/10.21437/Interspeech.2018-1929

work page doi:10.21437/interspeech.2018-1929 2018
[8]

Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer, 2025

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer, 2025. URLhttps://arxiv.org/abs/2412.00733

work page arXiv 2025
[9]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

2019
[10]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URLhttps://arxiv.org/abs/2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025. 10

work page arXiv 2025
[12]

Generative Adversarial Networks

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014. URL https://arxiv.org/abs/ 1406.2661

work page internal anchor Pith review Pith/arXiv arXiv 2014
[13]

Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

2022
[14]

From inpainting to editing: A self-bootstrapping framework for context-rich visual dubbing.arXiv preprint arXiv:2512.25066, 2025

Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, and Zhiyong Wu. From inpainting to editing: A self-bootstrapping framework for context-rich visual dubbing.arXiv preprint arXiv:2512.25066, 2025

work page arXiv 2025
[15]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium, 2018. URL https: //arxiv.org/abs/1706.08500

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[17]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

2022
[18]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self Forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, and Steven Hoi. Live avatar: Streaming real-time audio-driven avatar generation with infinite length, 2025. URLhttps://arxiv.org/abs/2512.04677

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, and Seungryong Kim. Matrix: Mask track alignment for interaction-aware video generation, 2025. URL https://arxiv.org/abs/2510.07310

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Moditalker: Motion-disentangled diffusion model for high-fidelity talking head generation, 2024

Seyeon Kim, Siyoon Jin, Jihye Park, Kihong Kim, Jiyoung Kim, Jisu Nam, and Seungryong Kim. Moditalker: Motion-disentangled diffusion model for high-fidelity talking head generation, 2024. URL https://arxiv.org/abs/2403.19144

work page arXiv 2024
[22]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

V-warper: Appearance-consistent video diffusion personalization via value warping, 2025

Hyunkoo Lee, Wooseok Jang, Jini Yang, Taehwan Kim, Sangoh Kim, Sangwon Jung, and Seungryong Kim. V-warper: Appearance-consistent video diffusion personalization via value warping, 2025. URL https://arxiv.org/abs/2512.12375

work page arXiv 2025
[24]

3d scene prompting for scene-consistent camera-controllable video generation, 2025

JoungBin Lee, Jaewoo Jung, Jisang Han, Takuya Narihira, Kazumi Fukuda, Junyoung Seo, Sunghwan Hong, Yuki Mitsufuji, and Seungryong Kim. 3d scene prompting for scene-consistent camera-controllable video generation, 2025. URLhttps://arxiv.org/abs/2510.14945

work page arXiv 2025
[25]

LatentSync: Taming audio-conditioned latent diffusion models for lip sync with SyncNet supervision.arXiv preprint arXiv:2412.09262, 2024

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. LatentSync: Taming audio-conditioned latent diffusion models for lip sync with SyncNet supervision.arXiv preprint arXiv:2412.09262, 2024

work page arXiv 2024
[26]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time, 2025. URLhttps://arxiv.org/abs/2509.25161

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

SayAnything: Audio-driven lip synchronization with conditional video diffusion, 2025

Junxian Ma, Shiwen Wang, Jian Yang, Junyi Hu, Jian Liang, Guosheng Lin, Jingbo Chen, Kai Li, and Yu Meng. SayAnything: Audio-driven lip synchronization with conditional video diffusion, 2025. URL https://arxiv.org/abs/2502.11515. 11

work page arXiv 2025
[31]

Diff2lip: Audio conditioned diffusion models for lip-synchronization, 2023

Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, and Abhinav Shrivastava. Diff2lip: Audio conditioned diffusion models for lip-synchronization, 2023. URL https://arxiv.org/abs/2308. 09716

2023
[32]

Omnisync: Towards universal lip synchronization via diffusion transformers, 2025

Ziqiao Peng, Jiwen Liu, Haoxian Zhang, Xiaoqiang Liu, Songlin Tang, Pengfei Wan, Di Zhang, Hongyan Liu, and Jun He. Omnisync: Towards universal lip synchronization via diffusion transformers, 2025. URL https://arxiv.org/abs/2505.21448

work page arXiv 2025
[33]

In: Proceedings of the 28th ACM International Conference on Multimedia

K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V . Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 484–492. ACM, October 2020. doi: 10.1145/3394171.3413532. URL http://dx.doi.org/10.1145/3394171.3413532

work page doi:10.1145/3394171.3413532 2020
[34]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[35]

Sand.ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls, 2025. URL https://arxiv. org/abs/2511.01266

work page arXiv 2025
[37]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022. URL https://arxiv.org/abs/2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Blindly assess image quality in the wild guided by a self-adaptive hyper network

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3667–3676, 2020

2020
[39]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges, 2019. URL https://arxiv.org/abs/1812.01717

work page internal anchor Pith review Pith/arXiv arXiv 2019
[40]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Tan, and Haizhou Li

Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T. Tan, and Haizhou Li. Seeing what you said: Talking face generation guided by a lip reading expert, 2023. URLhttps://arxiv.org/abs/2303.17480

work page arXiv 2023
[42]

Fantasytalking: Realistic talking portrait generation via coherent motion synthesis, 2025

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis, 2025. URL https://arxiv.org/abs/2504.04842

work page arXiv 2025
[43]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

2004
[44]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. LongLive: Real-time interactive long video generation, 2025. URLhttps://arxiv.org/abs/2509.22622

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025

work page arXiv 2025
[47]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

2024
[48]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

2024
[49]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

2025
[50]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018
[51]

MuseTalk: Real-time high-fidelity video dubbing via spatio-temporal sampling, 2025

Yue Zhang, Zhizhou Zhong, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, and Wenjiang Zhou. MuseTalk: Real-time high-fidelity video dubbing via spatio-temporal sampling, 2025. URLhttps://arxiv.org/abs/2410.10122

work page arXiv 2025
[52]

Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3661–3670, 2021. 13 Appendix A Index of supplementary material This appendix is organized into seven sections. Section B ...

2021

[1] [1]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020. URL https://arxiv.org/abs/2006.11477

work page arXiv 2020

[2] [2]

Taehv: Tiny autoencoder for hunyuan video

Ollin Boer Bohan. Taehv: Tiny autoencoder for hunyuan video. https://github.com/madebyollin/ taehv, 2025

2025

[3] [3]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024

[4] [4]

Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025

Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, and Benyou Wang. Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025. URL https://arxiv.org/ abs/2508.13618

work page arXiv 2025

[5] [5]

VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild, 2022

Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, and Nannan Wang. VideoReTalking: Audio-based lip synchronization for talking head video editing in the wild, 2022. URLhttps://arxiv.org/abs/2211.14758

work page arXiv 2022

[6] [6]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016

2016

[7] [7]

V oxceleb2: Deep speaker recognition

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxceleb2: Deep speaker recognition. In Interspeech 2018, page 1086–1090. ISCA, September 2018. doi: 10.21437/interspeech.2018-1929. URL http://dx.doi.org/10.21437/Interspeech.2018-1929

work page doi:10.21437/interspeech.2018-1929 2018

[8] [8]

Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer, 2025

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer, 2025. URLhttps://arxiv.org/abs/2412.00733

work page arXiv 2025

[9] [9]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

2019

[10] [10]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URLhttps://arxiv.org/abs/2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025. 10

work page arXiv 2025

[12] [12]

Generative Adversarial Networks

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014. URL https://arxiv.org/abs/ 1406.2661

work page internal anchor Pith review Pith/arXiv arXiv 2014

[13] [13]

Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

2022

[14] [14]

From inpainting to editing: A self-bootstrapping framework for context-rich visual dubbing.arXiv preprint arXiv:2512.25066, 2025

Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, and Zhiyong Wu. From inpainting to editing: A self-bootstrapping framework for context-rich visual dubbing.arXiv preprint arXiv:2512.25066, 2025

work page arXiv 2025

[15] [15]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium, 2018. URL https: //arxiv.org/abs/1706.08500

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[17] [17]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

2022

[18] [18]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self Forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, and Steven Hoi. Live avatar: Streaming real-time audio-driven avatar generation with infinite length, 2025. URLhttps://arxiv.org/abs/2512.04677

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, and Seungryong Kim. Matrix: Mask track alignment for interaction-aware video generation, 2025. URL https://arxiv.org/abs/2510.07310

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Moditalker: Motion-disentangled diffusion model for high-fidelity talking head generation, 2024

Seyeon Kim, Siyoon Jin, Jihye Park, Kihong Kim, Jiyoung Kim, Jisu Nam, and Seungryong Kim. Moditalker: Motion-disentangled diffusion model for high-fidelity talking head generation, 2024. URL https://arxiv.org/abs/2403.19144

work page arXiv 2024

[22] [22]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

V-warper: Appearance-consistent video diffusion personalization via value warping, 2025

Hyunkoo Lee, Wooseok Jang, Jini Yang, Taehwan Kim, Sangoh Kim, Sangwon Jung, and Seungryong Kim. V-warper: Appearance-consistent video diffusion personalization via value warping, 2025. URL https://arxiv.org/abs/2512.12375

work page arXiv 2025

[24] [24]

3d scene prompting for scene-consistent camera-controllable video generation, 2025

JoungBin Lee, Jaewoo Jung, Jisang Han, Takuya Narihira, Kazumi Fukuda, Junyoung Seo, Sunghwan Hong, Yuki Mitsufuji, and Seungryong Kim. 3d scene prompting for scene-consistent camera-controllable video generation, 2025. URLhttps://arxiv.org/abs/2510.14945

work page arXiv 2025

[25] [25]

LatentSync: Taming audio-conditioned latent diffusion models for lip sync with SyncNet supervision.arXiv preprint arXiv:2412.09262, 2024

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. LatentSync: Taming audio-conditioned latent diffusion models for lip sync with SyncNet supervision.arXiv preprint arXiv:2412.09262, 2024

work page arXiv 2024

[26] [26]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time, 2025. URLhttps://arxiv.org/abs/2509.25161

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

SayAnything: Audio-driven lip synchronization with conditional video diffusion, 2025

Junxian Ma, Shiwen Wang, Jian Yang, Junyi Hu, Jian Liang, Guosheng Lin, Jingbo Chen, Kai Li, and Yu Meng. SayAnything: Audio-driven lip synchronization with conditional video diffusion, 2025. URL https://arxiv.org/abs/2502.11515. 11

work page arXiv 2025

[31] [31]

Diff2lip: Audio conditioned diffusion models for lip-synchronization, 2023

Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, and Abhinav Shrivastava. Diff2lip: Audio conditioned diffusion models for lip-synchronization, 2023. URL https://arxiv.org/abs/2308. 09716

2023

[32] [32]

Omnisync: Towards universal lip synchronization via diffusion transformers, 2025

Ziqiao Peng, Jiwen Liu, Haoxian Zhang, Xiaoqiang Liu, Songlin Tang, Pengfei Wan, Di Zhang, Hongyan Liu, and Jun He. Omnisync: Towards universal lip synchronization via diffusion transformers, 2025. URL https://arxiv.org/abs/2505.21448

work page arXiv 2025

[33] [33]

In: Proceedings of the 28th ACM International Conference on Multimedia

K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V . Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 484–492. ACM, October 2020. doi: 10.1145/3394171.3413532. URL http://dx.doi.org/10.1145/3394171.3413532

work page doi:10.1145/3394171.3413532 2020

[34] [34]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[35] [35]

Sand.ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls, 2025. URL https://arxiv. org/abs/2511.01266

work page arXiv 2025

[37] [37]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022. URL https://arxiv.org/abs/2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2022

[38] [38]

Blindly assess image quality in the wild guided by a self-adaptive hyper network

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3667–3676, 2020

2020

[39] [39]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges, 2019. URL https://arxiv.org/abs/1812.01717

work page internal anchor Pith review Pith/arXiv arXiv 2019

[40] [40]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Tan, and Haizhou Li

Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T. Tan, and Haizhou Li. Seeing what you said: Talking face generation guided by a lip reading expert, 2023. URLhttps://arxiv.org/abs/2303.17480

work page arXiv 2023

[42] [42]

Fantasytalking: Realistic talking portrait generation via coherent motion synthesis, 2025

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis, 2025. URL https://arxiv.org/abs/2504.04842

work page arXiv 2025

[43] [43]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

2004

[44] [44]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. LongLive: Real-time interactive long video generation, 2025. URLhttps://arxiv.org/abs/2509.22622

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025

work page arXiv 2025

[47] [47]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

2024

[48] [48]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

2024

[49] [49]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

2025

[50] [50]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018

[51] [51]

MuseTalk: Real-time high-fidelity video dubbing via spatio-temporal sampling, 2025

Yue Zhang, Zhizhou Zhong, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, and Wenjiang Zhou. MuseTalk: Real-time high-fidelity video dubbing via spatio-temporal sampling, 2025. URLhttps://arxiv.org/abs/2410.10122

work page arXiv 2025

[52] [52]

Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3661–3670, 2021. 13 Appendix A Index of supplementary material This appendix is organized into seven sections. Section B ...

2021