Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

Lei Wang; Yongsheng Gao; Yu Zhang; Zhicheng Zhang

arxiv: 2605.25488 · v1 · pith:43TEZSGCnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI· cs.MM

Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

Zhicheng Zhang , Lei Wang , Yu Zhang , Yongsheng Gao This is my paper

Pith reviewed 2026-06-29 22:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords talking-head generationtest-time adaptationaudio-driven animationidentity preservationtemporal consistencygenerative stabilityfeedback conditioning

0 comments

The pith

A single test-time adaptation step using the generator's own outputs refines the conditioning reference and stabilizes identity and motion in audio-driven talking-head videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing audio-driven talking-head models condition an entire video on one fixed reference portrait. This creates a mismatch with changing facial motion that produces identity drift and temporal inconsistency. TT-SAC composes the generator with its encoder in a feedback loop so that the generator's outputs are re-encoded to produce an updated conditioning signal. One adaptation step approximates the self-consistent equilibrium of this loop. Theoretical analysis shows the step reduces feature variance under mild Lipschitz conditions and exposes a bias-variance tradeoff. Experiments on multiple pretrained generators and benchmarks report gains in lip-sync accuracy, temporal coherence, identity preservation, and perceptual quality without any retraining or extra supervision.

Core claim

By feeding the generator's outputs back through its encoder, a single adaptation step constructs a refined conditioning representation that approximates the fixed point of the generator-encoder composition. This fixed point aligns more closely with the temporal dynamics of the synthesized sequence, reducing variance in identity and motion features while improving generative stability under mild Lipschitz assumptions on the composition.

What carries the argument

The feedback loop that composes the pretrained generator with its encoder to derive a sequence-aligned conditioning signal from the generator's own outputs in one step.

If this is right

Identity preservation and temporal coherence improve on existing benchmark datasets for multiple pretrained generators.
Lip-sync accuracy and perceptual fidelity increase without retraining, gradient updates, or additional data.
The method applies model-agnostically to any pretrained talking-head generator.
A principled bias-variance tradeoff controls the strength of the adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback construction could be tested on other conditional video tasks where a static reference mismatches evolving outputs.
Extending the one-step update to a few iterations might further reduce residual drift if inference budget permits.
Evaluating the approach on sequences much longer than current benchmarks would test whether the approximated equilibrium persists over extended time.

Load-bearing premise

The generator-encoder composition forms a feedback loop whose fixed point can be reached in one step and that this fixed point improves alignment with the true temporal dynamics of the target sequence.

What would settle it

If applying the single adaptation step increases measured identity drift or temporal inconsistency on standard benchmark videos relative to the static-reference baseline, the stability claim is falsified.

Figures

Figures reproduced from arXiv: 2605.25488 by Lei Wang, Yongsheng Gao, Yu Zhang, Zhicheng Zhang.

**Figure 2.** Figure 2: Test-Time Self-Adaptive Conditioning (TT-SAC). Conventional talking-head generators use a single identity embedding [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of typical failure cases. Each block compares the real video against the baseline, baseline+TT [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of the number of aggregated frames [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Generalization to other audio-driven video synthesis tasks in OmniAvatar [49]. Our method improves motion [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We introduce Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free inference framework that enables pretrained talking-head generators to adapt their conditioning representations during inference without retraining, gradient updates, or additional supervision. Instead of treating the reference portrait as immutable, TT-SAC composes the generator with its encoder in a feedback loop: the generator's own outputs are re-encoded to construct a refined conditioning representation that better aligns with the temporal dynamics of the synthesized sequence. A single adaptation step approximates a self-consistent equilibrium of the generative process, stabilizing identity and motion across time. We further provide theoretical analysis showing that test-time conditioning adaptation reduces feature variance and improves generative stability under mild Lipschitz assumptions, while exhibiting a principled bias-variance tradeoff that governs the optimal strength of adaptation. Extensive experiments on state-of-the-art talking-head generators and benchmark datasets demonstrate consistent improvements in lip-sync accuracy, temporal coherence, identity preservation, and perceptual fidelity. TT-SAC offers a model-agnostic and training-free strategy for enhancing generative video models, establishing test-time conditioning adaptation as an effective mechanism for stabilizing audio-driven portrait animation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TT-SAC adds a feedback-loop tweak for test-time conditioning in talking heads that looks workable but rests on thin theory and missing experimental details.

read the letter

TT-SAC is a training-free method that feeds the generator's own outputs back through its encoder to update the reference conditioning during inference. The core move is treating the static reference as adjustable via this one-step loop to reduce identity drift and motion inconsistency.

The construction itself is new within the talking-head literature. Prior work on inference-time adaptation exists, but the specific generator-encoder composition for audio-driven portrait animation is not in the cited models. The paper also states that the method is model-agnostic and reports gains on lip-sync, coherence, and perceptual scores when plugged into existing generators.

The soft spots are in the supporting claims. The abstract mentions a theoretical argument under mild Lipschitz assumptions and a bias-variance tradeoff, yet supplies no equations, no contraction bound, and no iteration analysis to show why one step suffices for equilibrium. Without those, the stabilization could come from simple feature averaging rather than the claimed fixed-point property. Experiments are described only at a high level with no error bars, dataset sizes, or confirmation that the adaptation strength was not tuned after seeing results.

This paper is for researchers who work on audio-driven video generation and want quick inference fixes. A reader already running models like AniTalker or FLOAT could test the loop in an afternoon and see if it helps their outputs.

The work is coherent enough on its own terms to merit referee time. The idea is straightforward to implement and the empirical pattern is worth checking even if the theory needs tightening.

Referee Report

3 major / 1 minor

Summary. The paper proposes Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free, training-free inference framework for audio-driven talking-head generation. It composes a pretrained generator with its encoder in a feedback loop so that the generator's outputs are re-encoded to produce a refined conditioning representation; a single adaptation step is claimed to approximate a self-consistent equilibrium that stabilizes identity and motion. The manuscript supplies a theoretical argument that this reduces feature variance under mild Lipschitz assumptions while exhibiting a bias-variance tradeoff, and reports empirical gains in lip-sync accuracy, temporal coherence, identity preservation, and perceptual quality across state-of-the-art generators and benchmark datasets.

Significance. If the one-step fixed-point claim can be supported by an explicit contraction analysis and the empirical results prove reproducible with proper controls, TT-SAC would constitute a general, model-agnostic technique for improving temporal stability in conditional video generators without retraining or additional data.

major comments (3)

[Abstract] Abstract (theoretical analysis paragraph): the claim that 'a single adaptation step approximates a self-consistent equilibrium' under 'mild Lipschitz assumptions' is not accompanied by any contraction-mapping bound, residual analysis, or condition on the Lipschitz constant L that would guarantee sufficient accuracy after one iteration; without such a bound the observed stabilization may reduce to simple feature averaging rather than equilibrium properties.
[Abstract] Abstract: the method is repeatedly described as 'parameter-free,' yet the same paragraph states that adaptation strength is governed by a 'principled bias-variance tradeoff' that determines its 'optimal strength'; this internal tension must be resolved by showing either that the strength is derived without any free parameter or that it is fixed by a universal rule independent of the test sequence.
[Abstract] Abstract (experiments paragraph): no error bars, statistical tests, dataset cardinalities, or verification that adaptation strength was not selected post-hoc on the evaluation sets are supplied, so the reported 'consistent improvements' cannot be assessed for reliability or generality.

minor comments (1)

[Abstract] Abstract: the models AniTalker, FLOAT, and Sonic are named but the manuscript should confirm that full bibliographic references appear in the reference list.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, offering clarifications and committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract (theoretical analysis paragraph): the claim that 'a single adaptation step approximates a self-consistent equilibrium' under 'mild Lipschitz assumptions' is not accompanied by any contraction-mapping bound, residual analysis, or condition on the Lipschitz constant L that would guarantee sufficient accuracy after one iteration; without such a bound the observed stabilization may reduce to simple feature averaging rather than equilibrium properties.

Authors: We agree that an explicit contraction-mapping bound and residual analysis would strengthen the one-step approximation claim. Our current theoretical argument establishes variance reduction under Lipschitz continuity but does not supply a quantitative bound on the residual after a single iteration. In the revised manuscript we will add this analysis, deriving a bound on the approximation error conditioned on L. revision: yes
Referee: [Abstract] Abstract: the method is repeatedly described as 'parameter-free,' yet the same paragraph states that adaptation strength is governed by a 'principled bias-variance tradeoff' that determines its 'optimal strength'; this internal tension must be resolved by showing either that the strength is derived without any free parameter or that it is fixed by a universal rule independent of the test sequence.

Authors: The designation 'parameter-free' denotes the absence of any learned parameters, retraining, or test-time optimization. The bias-variance tradeoff is used only to derive a single fixed strength value that is applied uniformly to every test sequence; this value is independent of individual data and chosen once from the theoretical analysis. We will revise the abstract to state this explicitly and remove any implication of per-sequence optimization. revision: yes
Referee: [Abstract] Abstract (experiments paragraph): no error bars, statistical tests, dataset cardinalities, or verification that adaptation strength was not selected post-hoc on the evaluation sets are supplied, so the reported 'consistent improvements' cannot be assessed for reliability or generality.

Authors: We concur that these reporting elements are necessary. The revised version will add error bars from multiple random seeds, paired statistical tests, explicit dataset cardinalities, and a statement confirming that the adaptation strength was fixed a priori from the theoretical analysis and never tuned on the evaluation sets. revision: yes

Circularity Check

0 steps flagged

No circularity; adaptation and equilibrium approximation defined independently of claimed stability gains

full rationale

The paper explicitly defines TT-SAC as a parameter-free composition of generator and encoder into a one-step feedback loop that approximates equilibrium, then separately claims (under mild Lipschitz assumptions) that this reduces feature variance and exhibits a bias-variance tradeoff. No equations, self-citations, or fitted parameters are shown that would make the reported stabilization or alignment with temporal dynamics equivalent to the input definitions by construction. The derivation chain is self-contained; the one-step approximation is presented as an empirical mechanism rather than a tautological renaming or load-bearing self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; free parameters and axioms are inferred from stated claims rather than explicit derivations.

free parameters (1)

adaptation strength
Governs the bias-variance tradeoff in the theoretical analysis; value not supplied in abstract.

axioms (1)

domain assumption mild Lipschitz assumptions on the generator-encoder composition
Invoked to guarantee variance reduction and existence of equilibrium after one adaptation step.

pith-pipeline@v0.9.1-grok · 5808 in / 1269 out tokens · 31488 ms · 2026-06-29T22:34:32.369724+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 17 canonical work pages · 6 internal anchors

[1]

First order motion model for image animation,

A. Siarohin, S. Lathuili `ere, S. Tulyakov, E. Ricci, and N. Sebe, “First order motion model for image animation,” inAdvances in Neural Information Processing Systems, 2019, pp. 7135–7145

2019
[2]

Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,

L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7832–7841

2019
[3]

A lip sync expert is all you need for speech to lip generation in the wild,

K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492

2020
[4]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,

M. Xu, H. Li, Q. Su, H. Shang, L. Zhang, C. Liu, J. Wang, Y . Yao, and S. Zhu, “Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,”arXiv preprint arXiv:2406.08801, 2024

work page arXiv 2024
[5]

Hallo2: Long-duration and high-resolution audio-driven portrait image animation,

J. Cui, H. Li, Y . Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang, “Hallo2: Long-duration and high-resolution audio-driven portrait image animation,”arXiv preprint arXiv:2410.07718, 2024

work page arXiv 2024
[6]

Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer,

J. Cui, H. Li, Y . Zhan, H. Shang, K. Cheng, Y . Ma, S. Mu, H. Zhou, J. Wang, and S. Zhu, “Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 086–21 095

2025
[7]

Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation,

J. Cui, Y . Chen, M. Xu, H. Shang, Y . Chen, Y . Zhan, Z. Dong, Y . Yao, J. Wang, and S. Zhu, “Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation,” arXiv preprint arXiv:2505.23525, 2025

work page arXiv 2025
[8]

Dawn: Dynamic frame avatar with non-autoregressive diffusion framework for talking head video generation,

H. Cheng, L. Lin, C. Liu, P. Xia, P. Hu, J. Ma, J. Du, and J. Pan, “Dawn: Dynamic frame avatar with non-autoregressive diffusion framework for talking head video generation,”arXiv preprint arXiv:2410.13726, 2024

work page arXiv 2024
[9]

Aniportrait: Audio-driven synthesis of photorealistic portrait animation,

H. Wei, Z. Yang, and Z. Wang, “Aniportrait: Audio-driven synthesis of photorealistic portrait animation,”arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024
[10]

Liveportrait: Efficient portrait animation with stitching and retargeting control,

J. Guo, D. Zhang, X. Liu, Z. Zhong, Y . Zhang, P. Wan, and D. Zhang, “Liveportrait: Efficient portrait animation with stitching and retargeting control,”arXiv preprint arXiv:2407.03168, 2024

work page arXiv 2024
[11]

Vasa-1: Lifelike audio-driven talking faces generated in real time,

S. Xu, G. Chen, Y .-X. Guo, J. Yang, C. Li, Z. Zang, Y . Zhang, X. Tong, and B. Guo, “Vasa-1: Lifelike audio-driven talking faces generated in real time,” inAdvances in Neural Information Processing Systems, 2024, pp. 660–684

2024
[12]

Talking-head generation in practice,

Z. Zhang, L. Wang, Y . Gao, and Y . Zhang, “Talking-head generation in practice,” inThe Second International Workshop on Transformative Insights in Multifaceted Evaluation at The Web Conference 2026, 2026. [Online]. Available: https://openreview.net/forum?id=ns3TgZYQTZ

2026
[13]

Sadtalker: Learning realistic 3d motion coefficients for styl- ized audio-driven single image talking face animation,

W. Zhang, X. Cun, X. Wang, Y . Zhang, X. Shen, Y . Guo, Y . Shan, and F. Wang, “Sadtalker: Learning realistic 3d motion coefficients for styl- ized audio-driven single image talking face animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661

2023
[14]

Anitalker: animate vivid and diverse talking faces through identity- decoupled facial motion encoding,

T. Liu, F. Chen, S. Fan, C. Du, Q. Chen, X. Chen, and K. Yu, “Anitalker: animate vivid and diverse talking faces through identity- decoupled facial motion encoding,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6696–6705

2024
[15]

Float: Generative motion latent flow match- ing for audio-driven talking portrait,

T. Ki, D. Min, and G. Chae, “Float: Generative motion latent flow match- ing for audio-driven talking portrait,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 699–14 710

2025
[16]

Sonic: Shifting focus to global audio perception in portrait animation,

X. Ji, X. Hu, Z. Xu, J. Zhu, C. Lin, Q. He, J. Zhang, D. Luo, Y . Chen, Q. Linet al., “Sonic: Shifting focus to global audio perception in portrait animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 193–203

2025
[17]

ObamaNet: Photo-realistic lip-sync from text

R. Kumar, J. Sotelo, K. Kumar, A. De Brebisson, and Y . Ben- gio, “Obamanet: Photo-realistic lip-sync from text,”arXiv preprint arXiv:1801.01442, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Deep video portraits,

H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. P ´erez, C. Richardt, M. Zollh¨ofer, and C. Theobalt, “Deep video portraits,”ACM transactions on graphics (TOG), vol. 37, no. 4, pp. 1–14, 2018

2018
[19]

Syn- thesizing obama: learning lip sync from audio,

S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Syn- thesizing obama: learning lip sync from audio,”ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017

2017
[20]

Audio-driven facial animation by joint end-to-end learning of pose and emotion,

T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,”ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–12, 2017

2017
[21]

Out of time: automated lip sync in the wild,

J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inAsian Conference on Computer Vision, 2016, pp. 251–263

2016
[22]

Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks

S. A. Jalalifar, H. Hasani, and H. Aghajan, “Speech-driven facial reenactment using conditional generative adversarial networks,”arXiv preprint arXiv:1803.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Audio-driven talking face video generation with learning-based personalized head pose,

R. Yi, Z. Ye, J. Zhang, H. Bao, and Y .-J. Liu, “Audio-driven talking face video generation with learning-based personalized head pose,”arXiv preprint arXiv:2002.10137, 2020

work page arXiv 2002
[24]

Face2face: Real-time face capture and reenactment of rgb videos,

J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 2387–2395

2016
[25]

One-shot free-view neural talking-head synthesis for video conferencing,

T.-C. Wang, A. Mallya, and M.-Y . Liu, “One-shot free-view neural talking-head synthesis for video conferencing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 039–10 049

2021
[26]

Dreamtalk: When emotional talking head generation meets diffusion probabilistic models,

Y . Ma, S. Zhang, J. Wang, X. Wang, Y . Zhang, and Z. Deng, “Dreamtalk: When emotional talking head generation meets diffusion probabilistic models,”arXiv preprint arXiv:2312.09767, 2023. 12

work page arXiv 2023
[27]

Fantasytalking: Realistic talking portrait generation via coher- ent motion synthesis,

M. Wang, Q. Wang, F. Jiang, Y . Fan, Y . Zhang, Y . Qi, K. Zhao, and M. Xu, “Fantasytalking: Realistic talking portrait generation via coher- ent motion synthesis,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9891–9900

2025
[28]

Let them talk: Audio-driven multi-person conversational video generation,

Z. Kong, F. Gao, Y . Zhang, Z. Kang, X. Wei, X. Cai, G. Chen, and W. Luo, “Let them talk: Audio-driven multi-person conversational video generation,”arXiv preprint arXiv:2505.22647, 2025

work page arXiv 2025
[29]

Playmate: Flexible control of portrait animation via 3d-implicit space guided diffusion,

X. Ma, J. Cai, Y . Guan, S. Huang, Q. Zhang, and S. Zhang, “Playmate: Flexible control of portrait animation via 3d-implicit space guided diffusion,”arXiv preprint arXiv:2502.07203, 2025

work page arXiv 2025
[30]

Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,

Z. Chen, J. Cao, Z. Chen, Y . Li, and C. Ma, “Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 2403–2410

2025
[31]

Megaportraits: One-shot megapixel neural head avatars,

N. Drobyshev, J. Chelishev, T. Khakhulin, A. Ivakhnenko, V . Lempitsky, and E. Zakharov, “Megaportraits: One-shot megapixel neural head avatars,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 2663–2671

2022
[32]

Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,

W. Tan, C. Lin, C. Xu, F. Xu, X. Hu, X. Ji, J. Zhu, C. Wang, and Y . Fu, “Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9987–9995

2025
[33]

Learnable expansion of graph operators for multi-modal feature fusion,

D. Ding, L. Wang, L. Zhu, T. Gedeon, and P. Koniusz, “Learnable expansion of graph operators for multi-modal feature fusion,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=SMZqIOSdlN

2025
[34]

Multimodal fusion for talking face generation utilizing speech-related facial action units,

Z. Liu, X. Liu, S. Chen, J. Liu, L. Wang, and C. Bi, “Multimodal fusion for talking face generation utilizing speech-related facial action units,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 9, pp. 1–24, 2024

2024
[35]

Stylesync: High-fidelity generalized and personalized lip sync in style-based generator,

J. Guan, Z. Zhang, H. Zhou, T. Hu, K. Wang, D. He, H. Feng, J. Liu, E. Ding, Z. Liuet al., “Stylesync: High-fidelity generalized and personalized lip sync in style-based generator,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1505–1515

2023
[36]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, 2020, pp. 6840– 6851

2020
[37]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[38]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” inAdvances in Neural Information Processing Systems, 2022, pp. 5775–5787

2022
[40]

The monte carlo method,

N. Metropolis and S. Ulam, “The monte carlo method,”Journal of the American statistical association, vol. 44, no. 247, pp. 335–341, 1949

1949
[41]

Sur les op ´erations dans les ensembles abstraits et leur application aux ´equations int´egrales,

S. Banach, “Sur les op ´erations dans les ensembles abstraits et leur application aux ´equations int´egrales,”Fundamenta mathematicae, vol. 3, no. 1, pp. 133–181, 1922

1922
[42]

The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,

S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018

2018
[43]

Celebv-hq: A large-scale video facial attributes dataset,

H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, and C. C. Loy, “Celebv-hq: A large-scale video facial attributes dataset,” in European Conference on Computer Vision, 2022, pp. 650–667

2022
[44]

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

X. Cao, G. Wang, S. Shi, J. Zhao, Y . Yao, J. Fei, and M. Gao, “Joy- vasa: portrait and animal image animation with diffusion-based audio- driven facial dynamics and head motion generation,”arXiv preprint arXiv:2411.09209, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Rethinking fid: Towards a better evaluation metric for image generation,

S. Jayasumana, S. Ramalingam, A. Veit, D. Glasner, A. Chakrabarti, and S. Kumar, “Rethinking fid: Towards a better evaluation metric for image generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9307–9315

2024
[46]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,”arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Arcface: Additive angular margin loss for deep face recognition,

J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699

2019
[48]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 586–595

2018
[49]

Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,

Q. Gan, R. Yang, J. Zhu, S. Xue, and S. Hoi, “Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,” arXiv preprint arXiv:2506.18866, 2025. Zhicheng Zhangis a Ph.D. student at the University of New South Wales (UNSW), Australia, supervised by Dr. Yu Zhang (2024-present). He received his M.S. from The University of Queens...

work page arXiv 2025

[1] [1]

First order motion model for image animation,

A. Siarohin, S. Lathuili `ere, S. Tulyakov, E. Ricci, and N. Sebe, “First order motion model for image animation,” inAdvances in Neural Information Processing Systems, 2019, pp. 7135–7145

2019

[2] [2]

Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,

L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7832–7841

2019

[3] [3]

A lip sync expert is all you need for speech to lip generation in the wild,

K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492

2020

[4] [4]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,

M. Xu, H. Li, Q. Su, H. Shang, L. Zhang, C. Liu, J. Wang, Y . Yao, and S. Zhu, “Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,”arXiv preprint arXiv:2406.08801, 2024

work page arXiv 2024

[5] [5]

Hallo2: Long-duration and high-resolution audio-driven portrait image animation,

J. Cui, H. Li, Y . Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang, “Hallo2: Long-duration and high-resolution audio-driven portrait image animation,”arXiv preprint arXiv:2410.07718, 2024

work page arXiv 2024

[6] [6]

Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer,

J. Cui, H. Li, Y . Zhan, H. Shang, K. Cheng, Y . Ma, S. Mu, H. Zhou, J. Wang, and S. Zhu, “Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 086–21 095

2025

[7] [7]

Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation,

J. Cui, Y . Chen, M. Xu, H. Shang, Y . Chen, Y . Zhan, Z. Dong, Y . Yao, J. Wang, and S. Zhu, “Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation,” arXiv preprint arXiv:2505.23525, 2025

work page arXiv 2025

[8] [8]

Dawn: Dynamic frame avatar with non-autoregressive diffusion framework for talking head video generation,

H. Cheng, L. Lin, C. Liu, P. Xia, P. Hu, J. Ma, J. Du, and J. Pan, “Dawn: Dynamic frame avatar with non-autoregressive diffusion framework for talking head video generation,”arXiv preprint arXiv:2410.13726, 2024

work page arXiv 2024

[9] [9]

Aniportrait: Audio-driven synthesis of photorealistic portrait animation,

H. Wei, Z. Yang, and Z. Wang, “Aniportrait: Audio-driven synthesis of photorealistic portrait animation,”arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024

[10] [10]

Liveportrait: Efficient portrait animation with stitching and retargeting control,

J. Guo, D. Zhang, X. Liu, Z. Zhong, Y . Zhang, P. Wan, and D. Zhang, “Liveportrait: Efficient portrait animation with stitching and retargeting control,”arXiv preprint arXiv:2407.03168, 2024

work page arXiv 2024

[11] [11]

Vasa-1: Lifelike audio-driven talking faces generated in real time,

S. Xu, G. Chen, Y .-X. Guo, J. Yang, C. Li, Z. Zang, Y . Zhang, X. Tong, and B. Guo, “Vasa-1: Lifelike audio-driven talking faces generated in real time,” inAdvances in Neural Information Processing Systems, 2024, pp. 660–684

2024

[12] [12]

Talking-head generation in practice,

Z. Zhang, L. Wang, Y . Gao, and Y . Zhang, “Talking-head generation in practice,” inThe Second International Workshop on Transformative Insights in Multifaceted Evaluation at The Web Conference 2026, 2026. [Online]. Available: https://openreview.net/forum?id=ns3TgZYQTZ

2026

[13] [13]

Sadtalker: Learning realistic 3d motion coefficients for styl- ized audio-driven single image talking face animation,

W. Zhang, X. Cun, X. Wang, Y . Zhang, X. Shen, Y . Guo, Y . Shan, and F. Wang, “Sadtalker: Learning realistic 3d motion coefficients for styl- ized audio-driven single image talking face animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661

2023

[14] [14]

Anitalker: animate vivid and diverse talking faces through identity- decoupled facial motion encoding,

T. Liu, F. Chen, S. Fan, C. Du, Q. Chen, X. Chen, and K. Yu, “Anitalker: animate vivid and diverse talking faces through identity- decoupled facial motion encoding,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6696–6705

2024

[15] [15]

Float: Generative motion latent flow match- ing for audio-driven talking portrait,

T. Ki, D. Min, and G. Chae, “Float: Generative motion latent flow match- ing for audio-driven talking portrait,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 699–14 710

2025

[16] [16]

Sonic: Shifting focus to global audio perception in portrait animation,

X. Ji, X. Hu, Z. Xu, J. Zhu, C. Lin, Q. He, J. Zhang, D. Luo, Y . Chen, Q. Linet al., “Sonic: Shifting focus to global audio perception in portrait animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 193–203

2025

[17] [17]

ObamaNet: Photo-realistic lip-sync from text

R. Kumar, J. Sotelo, K. Kumar, A. De Brebisson, and Y . Ben- gio, “Obamanet: Photo-realistic lip-sync from text,”arXiv preprint arXiv:1801.01442, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Deep video portraits,

H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. P ´erez, C. Richardt, M. Zollh¨ofer, and C. Theobalt, “Deep video portraits,”ACM transactions on graphics (TOG), vol. 37, no. 4, pp. 1–14, 2018

2018

[19] [19]

Syn- thesizing obama: learning lip sync from audio,

S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Syn- thesizing obama: learning lip sync from audio,”ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017

2017

[20] [20]

Audio-driven facial animation by joint end-to-end learning of pose and emotion,

T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,”ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–12, 2017

2017

[21] [21]

Out of time: automated lip sync in the wild,

J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inAsian Conference on Computer Vision, 2016, pp. 251–263

2016

[22] [22]

Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks

S. A. Jalalifar, H. Hasani, and H. Aghajan, “Speech-driven facial reenactment using conditional generative adversarial networks,”arXiv preprint arXiv:1803.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

Audio-driven talking face video generation with learning-based personalized head pose,

R. Yi, Z. Ye, J. Zhang, H. Bao, and Y .-J. Liu, “Audio-driven talking face video generation with learning-based personalized head pose,”arXiv preprint arXiv:2002.10137, 2020

work page arXiv 2002

[24] [24]

Face2face: Real-time face capture and reenactment of rgb videos,

J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 2387–2395

2016

[25] [25]

One-shot free-view neural talking-head synthesis for video conferencing,

T.-C. Wang, A. Mallya, and M.-Y . Liu, “One-shot free-view neural talking-head synthesis for video conferencing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 039–10 049

2021

[26] [26]

Dreamtalk: When emotional talking head generation meets diffusion probabilistic models,

Y . Ma, S. Zhang, J. Wang, X. Wang, Y . Zhang, and Z. Deng, “Dreamtalk: When emotional talking head generation meets diffusion probabilistic models,”arXiv preprint arXiv:2312.09767, 2023. 12

work page arXiv 2023

[27] [27]

Fantasytalking: Realistic talking portrait generation via coher- ent motion synthesis,

M. Wang, Q. Wang, F. Jiang, Y . Fan, Y . Zhang, Y . Qi, K. Zhao, and M. Xu, “Fantasytalking: Realistic talking portrait generation via coher- ent motion synthesis,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9891–9900

2025

[28] [28]

Let them talk: Audio-driven multi-person conversational video generation,

Z. Kong, F. Gao, Y . Zhang, Z. Kang, X. Wei, X. Cai, G. Chen, and W. Luo, “Let them talk: Audio-driven multi-person conversational video generation,”arXiv preprint arXiv:2505.22647, 2025

work page arXiv 2025

[29] [29]

Playmate: Flexible control of portrait animation via 3d-implicit space guided diffusion,

X. Ma, J. Cai, Y . Guan, S. Huang, Q. Zhang, and S. Zhang, “Playmate: Flexible control of portrait animation via 3d-implicit space guided diffusion,”arXiv preprint arXiv:2502.07203, 2025

work page arXiv 2025

[30] [30]

Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,

Z. Chen, J. Cao, Z. Chen, Y . Li, and C. Ma, “Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 2403–2410

2025

[31] [31]

Megaportraits: One-shot megapixel neural head avatars,

N. Drobyshev, J. Chelishev, T. Khakhulin, A. Ivakhnenko, V . Lempitsky, and E. Zakharov, “Megaportraits: One-shot megapixel neural head avatars,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 2663–2671

2022

[32] [32]

Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,

W. Tan, C. Lin, C. Xu, F. Xu, X. Hu, X. Ji, J. Zhu, C. Wang, and Y . Fu, “Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9987–9995

2025

[33] [33]

Learnable expansion of graph operators for multi-modal feature fusion,

D. Ding, L. Wang, L. Zhu, T. Gedeon, and P. Koniusz, “Learnable expansion of graph operators for multi-modal feature fusion,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=SMZqIOSdlN

2025

[34] [34]

Multimodal fusion for talking face generation utilizing speech-related facial action units,

Z. Liu, X. Liu, S. Chen, J. Liu, L. Wang, and C. Bi, “Multimodal fusion for talking face generation utilizing speech-related facial action units,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 9, pp. 1–24, 2024

2024

[35] [35]

Stylesync: High-fidelity generalized and personalized lip sync in style-based generator,

J. Guan, Z. Zhang, H. Zhou, T. Hu, K. Wang, D. He, H. Feng, J. Liu, E. Ding, Z. Liuet al., “Stylesync: High-fidelity generalized and personalized lip sync in style-based generator,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1505–1515

2023

[36] [36]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, 2020, pp. 6840– 6851

2020

[37] [37]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[38] [38]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” inAdvances in Neural Information Processing Systems, 2022, pp. 5775–5787

2022

[40] [40]

The monte carlo method,

N. Metropolis and S. Ulam, “The monte carlo method,”Journal of the American statistical association, vol. 44, no. 247, pp. 335–341, 1949

1949

[41] [41]

Sur les op ´erations dans les ensembles abstraits et leur application aux ´equations int´egrales,

S. Banach, “Sur les op ´erations dans les ensembles abstraits et leur application aux ´equations int´egrales,”Fundamenta mathematicae, vol. 3, no. 1, pp. 133–181, 1922

1922

[42] [42]

The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,

S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018

2018

[43] [43]

Celebv-hq: A large-scale video facial attributes dataset,

H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, and C. C. Loy, “Celebv-hq: A large-scale video facial attributes dataset,” in European Conference on Computer Vision, 2022, pp. 650–667

2022

[44] [44]

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

X. Cao, G. Wang, S. Shi, J. Zhao, Y . Yao, J. Fei, and M. Gao, “Joy- vasa: portrait and animal image animation with diffusion-based audio- driven facial dynamics and head motion generation,”arXiv preprint arXiv:2411.09209, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Rethinking fid: Towards a better evaluation metric for image generation,

S. Jayasumana, S. Ramalingam, A. Veit, D. Glasner, A. Chakrabarti, and S. Kumar, “Rethinking fid: Towards a better evaluation metric for image generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9307–9315

2024

[46] [46]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,”arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[47] [47]

Arcface: Additive angular margin loss for deep face recognition,

J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699

2019

[48] [48]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 586–595

2018

[49] [49]

Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,

Q. Gan, R. Yang, J. Zhu, S. Xue, and S. Hoi, “Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,” arXiv preprint arXiv:2506.18866, 2025. Zhicheng Zhangis a Ph.D. student at the University of New South Wales (UNSW), Australia, supervised by Dr. Yu Zhang (2024-present). He received his M.S. from The University of Queens...

work page arXiv 2025