pith. sign in

arxiv: 2605.25488 · v1 · pith:43TEZSGCnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI· cs.MM

Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

Pith reviewed 2026-06-29 22:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords talking-head generationtest-time adaptationaudio-driven animationidentity preservationtemporal consistencygenerative stabilityfeedback conditioning
0
0 comments X

The pith

A single test-time adaptation step using the generator's own outputs refines the conditioning reference and stabilizes identity and motion in audio-driven talking-head videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing audio-driven talking-head models condition an entire video on one fixed reference portrait. This creates a mismatch with changing facial motion that produces identity drift and temporal inconsistency. TT-SAC composes the generator with its encoder in a feedback loop so that the generator's outputs are re-encoded to produce an updated conditioning signal. One adaptation step approximates the self-consistent equilibrium of this loop. Theoretical analysis shows the step reduces feature variance under mild Lipschitz conditions and exposes a bias-variance tradeoff. Experiments on multiple pretrained generators and benchmarks report gains in lip-sync accuracy, temporal coherence, identity preservation, and perceptual quality without any retraining or extra supervision.

Core claim

By feeding the generator's outputs back through its encoder, a single adaptation step constructs a refined conditioning representation that approximates the fixed point of the generator-encoder composition. This fixed point aligns more closely with the temporal dynamics of the synthesized sequence, reducing variance in identity and motion features while improving generative stability under mild Lipschitz assumptions on the composition.

What carries the argument

The feedback loop that composes the pretrained generator with its encoder to derive a sequence-aligned conditioning signal from the generator's own outputs in one step.

If this is right

  • Identity preservation and temporal coherence improve on existing benchmark datasets for multiple pretrained generators.
  • Lip-sync accuracy and perceptual fidelity increase without retraining, gradient updates, or additional data.
  • The method applies model-agnostically to any pretrained talking-head generator.
  • A principled bias-variance tradeoff controls the strength of the adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feedback construction could be tested on other conditional video tasks where a static reference mismatches evolving outputs.
  • Extending the one-step update to a few iterations might further reduce residual drift if inference budget permits.
  • Evaluating the approach on sequences much longer than current benchmarks would test whether the approximated equilibrium persists over extended time.

Load-bearing premise

The generator-encoder composition forms a feedback loop whose fixed point can be reached in one step and that this fixed point improves alignment with the true temporal dynamics of the target sequence.

What would settle it

If applying the single adaptation step increases measured identity drift or temporal inconsistency on standard benchmark videos relative to the static-reference baseline, the stability claim is falsified.

Figures

Figures reproduced from arXiv: 2605.25488 by Lei Wang, Yongsheng Gao, Yu Zhang, Zhicheng Zhang.

Figure 1
Figure 1. Figure 1: Common failure cases in existing audio-driven talking [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Test-Time Self-Adaptive Conditioning (TT-SAC). Conventional talking-head generators use a single identity embedding [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of typical failure cases. Each block compares the real video against the baseline, baseline+TT [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the number of aggregated frames [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Generalization to other audio-driven video synthesis tasks in OmniAvatar [49]. Our method improves motion [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We introduce Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free inference framework that enables pretrained talking-head generators to adapt their conditioning representations during inference without retraining, gradient updates, or additional supervision. Instead of treating the reference portrait as immutable, TT-SAC composes the generator with its encoder in a feedback loop: the generator's own outputs are re-encoded to construct a refined conditioning representation that better aligns with the temporal dynamics of the synthesized sequence. A single adaptation step approximates a self-consistent equilibrium of the generative process, stabilizing identity and motion across time. We further provide theoretical analysis showing that test-time conditioning adaptation reduces feature variance and improves generative stability under mild Lipschitz assumptions, while exhibiting a principled bias-variance tradeoff that governs the optimal strength of adaptation. Extensive experiments on state-of-the-art talking-head generators and benchmark datasets demonstrate consistent improvements in lip-sync accuracy, temporal coherence, identity preservation, and perceptual fidelity. TT-SAC offers a model-agnostic and training-free strategy for enhancing generative video models, establishing test-time conditioning adaptation as an effective mechanism for stabilizing audio-driven portrait animation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free, training-free inference framework for audio-driven talking-head generation. It composes a pretrained generator with its encoder in a feedback loop so that the generator's outputs are re-encoded to produce a refined conditioning representation; a single adaptation step is claimed to approximate a self-consistent equilibrium that stabilizes identity and motion. The manuscript supplies a theoretical argument that this reduces feature variance under mild Lipschitz assumptions while exhibiting a bias-variance tradeoff, and reports empirical gains in lip-sync accuracy, temporal coherence, identity preservation, and perceptual quality across state-of-the-art generators and benchmark datasets.

Significance. If the one-step fixed-point claim can be supported by an explicit contraction analysis and the empirical results prove reproducible with proper controls, TT-SAC would constitute a general, model-agnostic technique for improving temporal stability in conditional video generators without retraining or additional data.

major comments (3)
  1. [Abstract] Abstract (theoretical analysis paragraph): the claim that 'a single adaptation step approximates a self-consistent equilibrium' under 'mild Lipschitz assumptions' is not accompanied by any contraction-mapping bound, residual analysis, or condition on the Lipschitz constant L that would guarantee sufficient accuracy after one iteration; without such a bound the observed stabilization may reduce to simple feature averaging rather than equilibrium properties.
  2. [Abstract] Abstract: the method is repeatedly described as 'parameter-free,' yet the same paragraph states that adaptation strength is governed by a 'principled bias-variance tradeoff' that determines its 'optimal strength'; this internal tension must be resolved by showing either that the strength is derived without any free parameter or that it is fixed by a universal rule independent of the test sequence.
  3. [Abstract] Abstract (experiments paragraph): no error bars, statistical tests, dataset cardinalities, or verification that adaptation strength was not selected post-hoc on the evaluation sets are supplied, so the reported 'consistent improvements' cannot be assessed for reliability or generality.
minor comments (1)
  1. [Abstract] Abstract: the models AniTalker, FLOAT, and Sonic are named but the manuscript should confirm that full bibliographic references appear in the reference list.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, offering clarifications and committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract (theoretical analysis paragraph): the claim that 'a single adaptation step approximates a self-consistent equilibrium' under 'mild Lipschitz assumptions' is not accompanied by any contraction-mapping bound, residual analysis, or condition on the Lipschitz constant L that would guarantee sufficient accuracy after one iteration; without such a bound the observed stabilization may reduce to simple feature averaging rather than equilibrium properties.

    Authors: We agree that an explicit contraction-mapping bound and residual analysis would strengthen the one-step approximation claim. Our current theoretical argument establishes variance reduction under Lipschitz continuity but does not supply a quantitative bound on the residual after a single iteration. In the revised manuscript we will add this analysis, deriving a bound on the approximation error conditioned on L. revision: yes

  2. Referee: [Abstract] Abstract: the method is repeatedly described as 'parameter-free,' yet the same paragraph states that adaptation strength is governed by a 'principled bias-variance tradeoff' that determines its 'optimal strength'; this internal tension must be resolved by showing either that the strength is derived without any free parameter or that it is fixed by a universal rule independent of the test sequence.

    Authors: The designation 'parameter-free' denotes the absence of any learned parameters, retraining, or test-time optimization. The bias-variance tradeoff is used only to derive a single fixed strength value that is applied uniformly to every test sequence; this value is independent of individual data and chosen once from the theoretical analysis. We will revise the abstract to state this explicitly and remove any implication of per-sequence optimization. revision: yes

  3. Referee: [Abstract] Abstract (experiments paragraph): no error bars, statistical tests, dataset cardinalities, or verification that adaptation strength was not selected post-hoc on the evaluation sets are supplied, so the reported 'consistent improvements' cannot be assessed for reliability or generality.

    Authors: We concur that these reporting elements are necessary. The revised version will add error bars from multiple random seeds, paired statistical tests, explicit dataset cardinalities, and a statement confirming that the adaptation strength was fixed a priori from the theoretical analysis and never tuned on the evaluation sets. revision: yes

Circularity Check

0 steps flagged

No circularity; adaptation and equilibrium approximation defined independently of claimed stability gains

full rationale

The paper explicitly defines TT-SAC as a parameter-free composition of generator and encoder into a one-step feedback loop that approximates equilibrium, then separately claims (under mild Lipschitz assumptions) that this reduces feature variance and exhibits a bias-variance tradeoff. No equations, self-citations, or fitted parameters are shown that would make the reported stabilization or alignment with temporal dynamics equivalent to the input definitions by construction. The derivation chain is self-contained; the one-step approximation is presented as an empirical mechanism rather than a tautological renaming or load-bearing self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; free parameters and axioms are inferred from stated claims rather than explicit derivations.

free parameters (1)
  • adaptation strength
    Governs the bias-variance tradeoff in the theoretical analysis; value not supplied in abstract.
axioms (1)
  • domain assumption mild Lipschitz assumptions on the generator-encoder composition
    Invoked to guarantee variance reduction and existence of equilibrium after one adaptation step.

pith-pipeline@v0.9.1-grok · 5808 in / 1269 out tokens · 31488 ms · 2026-06-29T22:34:32.369724+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    First order motion model for image animation,

    A. Siarohin, S. Lathuili `ere, S. Tulyakov, E. Ricci, and N. Sebe, “First order motion model for image animation,” inAdvances in Neural Information Processing Systems, 2019, pp. 7135–7145

  2. [2]

    Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,

    L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7832–7841

  3. [3]

    A lip sync expert is all you need for speech to lip generation in the wild,

    K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492

  4. [4]

    Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,

    M. Xu, H. Li, Q. Su, H. Shang, L. Zhang, C. Liu, J. Wang, Y . Yao, and S. Zhu, “Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,”arXiv preprint arXiv:2406.08801, 2024

  5. [5]

    Hallo2: Long-duration and high-resolution audio-driven portrait image animation,

    J. Cui, H. Li, Y . Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang, “Hallo2: Long-duration and high-resolution audio-driven portrait image animation,”arXiv preprint arXiv:2410.07718, 2024

  6. [6]

    Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer,

    J. Cui, H. Li, Y . Zhan, H. Shang, K. Cheng, Y . Ma, S. Mu, H. Zhou, J. Wang, and S. Zhu, “Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 086–21 095

  7. [7]

    Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation,

    J. Cui, Y . Chen, M. Xu, H. Shang, Y . Chen, Y . Zhan, Z. Dong, Y . Yao, J. Wang, and S. Zhu, “Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation,” arXiv preprint arXiv:2505.23525, 2025

  8. [8]

    Dawn: Dynamic frame avatar with non-autoregressive diffusion framework for talking head video generation,

    H. Cheng, L. Lin, C. Liu, P. Xia, P. Hu, J. Ma, J. Du, and J. Pan, “Dawn: Dynamic frame avatar with non-autoregressive diffusion framework for talking head video generation,”arXiv preprint arXiv:2410.13726, 2024

  9. [9]

    Aniportrait: Audio-driven synthesis of photorealistic portrait animation,

    H. Wei, Z. Yang, and Z. Wang, “Aniportrait: Audio-driven synthesis of photorealistic portrait animation,”arXiv preprint arXiv:2403.17694, 2024

  10. [10]

    Liveportrait: Efficient portrait animation with stitching and retargeting control,

    J. Guo, D. Zhang, X. Liu, Z. Zhong, Y . Zhang, P. Wan, and D. Zhang, “Liveportrait: Efficient portrait animation with stitching and retargeting control,”arXiv preprint arXiv:2407.03168, 2024

  11. [11]

    Vasa-1: Lifelike audio-driven talking faces generated in real time,

    S. Xu, G. Chen, Y .-X. Guo, J. Yang, C. Li, Z. Zang, Y . Zhang, X. Tong, and B. Guo, “Vasa-1: Lifelike audio-driven talking faces generated in real time,” inAdvances in Neural Information Processing Systems, 2024, pp. 660–684

  12. [12]

    Talking-head generation in practice,

    Z. Zhang, L. Wang, Y . Gao, and Y . Zhang, “Talking-head generation in practice,” inThe Second International Workshop on Transformative Insights in Multifaceted Evaluation at The Web Conference 2026, 2026. [Online]. Available: https://openreview.net/forum?id=ns3TgZYQTZ

  13. [13]

    Sadtalker: Learning realistic 3d motion coefficients for styl- ized audio-driven single image talking face animation,

    W. Zhang, X. Cun, X. Wang, Y . Zhang, X. Shen, Y . Guo, Y . Shan, and F. Wang, “Sadtalker: Learning realistic 3d motion coefficients for styl- ized audio-driven single image talking face animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661

  14. [14]

    Anitalker: animate vivid and diverse talking faces through identity- decoupled facial motion encoding,

    T. Liu, F. Chen, S. Fan, C. Du, Q. Chen, X. Chen, and K. Yu, “Anitalker: animate vivid and diverse talking faces through identity- decoupled facial motion encoding,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6696–6705

  15. [15]

    Float: Generative motion latent flow match- ing for audio-driven talking portrait,

    T. Ki, D. Min, and G. Chae, “Float: Generative motion latent flow match- ing for audio-driven talking portrait,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 699–14 710

  16. [16]

    Sonic: Shifting focus to global audio perception in portrait animation,

    X. Ji, X. Hu, Z. Xu, J. Zhu, C. Lin, Q. He, J. Zhang, D. Luo, Y . Chen, Q. Linet al., “Sonic: Shifting focus to global audio perception in portrait animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 193–203

  17. [17]

    ObamaNet: Photo-realistic lip-sync from text

    R. Kumar, J. Sotelo, K. Kumar, A. De Brebisson, and Y . Ben- gio, “Obamanet: Photo-realistic lip-sync from text,”arXiv preprint arXiv:1801.01442, 2017

  18. [18]

    Deep video portraits,

    H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. P ´erez, C. Richardt, M. Zollh¨ofer, and C. Theobalt, “Deep video portraits,”ACM transactions on graphics (TOG), vol. 37, no. 4, pp. 1–14, 2018

  19. [19]

    Syn- thesizing obama: learning lip sync from audio,

    S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Syn- thesizing obama: learning lip sync from audio,”ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017

  20. [20]

    Audio-driven facial animation by joint end-to-end learning of pose and emotion,

    T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,”ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–12, 2017

  21. [21]

    Out of time: automated lip sync in the wild,

    J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inAsian Conference on Computer Vision, 2016, pp. 251–263

  22. [22]

    Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks

    S. A. Jalalifar, H. Hasani, and H. Aghajan, “Speech-driven facial reenactment using conditional generative adversarial networks,”arXiv preprint arXiv:1803.07461, 2018

  23. [23]

    Audio-driven talking face video generation with learning-based personalized head pose,

    R. Yi, Z. Ye, J. Zhang, H. Bao, and Y .-J. Liu, “Audio-driven talking face video generation with learning-based personalized head pose,”arXiv preprint arXiv:2002.10137, 2020

  24. [24]

    Face2face: Real-time face capture and reenactment of rgb videos,

    J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 2387–2395

  25. [25]

    One-shot free-view neural talking-head synthesis for video conferencing,

    T.-C. Wang, A. Mallya, and M.-Y . Liu, “One-shot free-view neural talking-head synthesis for video conferencing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 039–10 049

  26. [26]

    Dreamtalk: When emotional talking head generation meets diffusion probabilistic models,

    Y . Ma, S. Zhang, J. Wang, X. Wang, Y . Zhang, and Z. Deng, “Dreamtalk: When emotional talking head generation meets diffusion probabilistic models,”arXiv preprint arXiv:2312.09767, 2023. 12

  27. [27]

    Fantasytalking: Realistic talking portrait generation via coher- ent motion synthesis,

    M. Wang, Q. Wang, F. Jiang, Y . Fan, Y . Zhang, Y . Qi, K. Zhao, and M. Xu, “Fantasytalking: Realistic talking portrait generation via coher- ent motion synthesis,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9891–9900

  28. [28]

    Let them talk: Audio-driven multi-person conversational video generation,

    Z. Kong, F. Gao, Y . Zhang, Z. Kang, X. Wei, X. Cai, G. Chen, and W. Luo, “Let them talk: Audio-driven multi-person conversational video generation,”arXiv preprint arXiv:2505.22647, 2025

  29. [29]

    Playmate: Flexible control of portrait animation via 3d-implicit space guided diffusion,

    X. Ma, J. Cai, Y . Guan, S. Huang, Q. Zhang, and S. Zhang, “Playmate: Flexible control of portrait animation via 3d-implicit space guided diffusion,”arXiv preprint arXiv:2502.07203, 2025

  30. [30]

    Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,

    Z. Chen, J. Cao, Z. Chen, Y . Li, and C. Ma, “Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 2403–2410

  31. [31]

    Megaportraits: One-shot megapixel neural head avatars,

    N. Drobyshev, J. Chelishev, T. Khakhulin, A. Ivakhnenko, V . Lempitsky, and E. Zakharov, “Megaportraits: One-shot megapixel neural head avatars,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 2663–2671

  32. [32]

    Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,

    W. Tan, C. Lin, C. Xu, F. Xu, X. Hu, X. Ji, J. Zhu, C. Wang, and Y . Fu, “Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9987–9995

  33. [33]

    Learnable expansion of graph operators for multi-modal feature fusion,

    D. Ding, L. Wang, L. Zhu, T. Gedeon, and P. Koniusz, “Learnable expansion of graph operators for multi-modal feature fusion,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=SMZqIOSdlN

  34. [34]

    Multimodal fusion for talking face generation utilizing speech-related facial action units,

    Z. Liu, X. Liu, S. Chen, J. Liu, L. Wang, and C. Bi, “Multimodal fusion for talking face generation utilizing speech-related facial action units,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 9, pp. 1–24, 2024

  35. [35]

    Stylesync: High-fidelity generalized and personalized lip sync in style-based generator,

    J. Guan, Z. Zhang, H. Zhou, T. Hu, K. Wang, D. He, H. Feng, J. Liu, E. Ding, Z. Liuet al., “Stylesync: High-fidelity generalized and personalized lip sync in style-based generator,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1505–1515

  36. [36]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, 2020, pp. 6840– 6851

  37. [37]

    Denoising Diffusion Implicit Models

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

  38. [38]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

  39. [39]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,

    C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” inAdvances in Neural Information Processing Systems, 2022, pp. 5775–5787

  40. [40]

    The monte carlo method,

    N. Metropolis and S. Ulam, “The monte carlo method,”Journal of the American statistical association, vol. 44, no. 247, pp. 335–341, 1949

  41. [41]

    Sur les op ´erations dans les ensembles abstraits et leur application aux ´equations int´egrales,

    S. Banach, “Sur les op ´erations dans les ensembles abstraits et leur application aux ´equations int´egrales,”Fundamenta mathematicae, vol. 3, no. 1, pp. 133–181, 1922

  42. [42]

    The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,

    S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018

  43. [43]

    Celebv-hq: A large-scale video facial attributes dataset,

    H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, and C. C. Loy, “Celebv-hq: A large-scale video facial attributes dataset,” in European Conference on Computer Vision, 2022, pp. 650–667

  44. [44]

    JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

    X. Cao, G. Wang, S. Shi, J. Zhao, Y . Yao, J. Fei, and M. Gao, “Joy- vasa: portrait and animal image animation with diffusion-based audio- driven facial dynamics and head motion generation,”arXiv preprint arXiv:2411.09209, 2024

  45. [45]

    Rethinking fid: Towards a better evaluation metric for image generation,

    S. Jayasumana, S. Ramalingam, A. Veit, D. Glasner, A. Chakrabarti, and S. Kumar, “Rethinking fid: Towards a better evaluation metric for image generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9307–9315

  46. [46]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,”arXiv preprint arXiv:1812.01717, 2018

  47. [47]

    Arcface: Additive angular margin loss for deep face recognition,

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699

  48. [48]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 586–595

  49. [49]

    Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,

    Q. Gan, R. Yang, J. Zhu, S. Xue, and S. Hoi, “Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,” arXiv preprint arXiv:2506.18866, 2025. Zhicheng Zhangis a Ph.D. student at the University of New South Wales (UNSW), Australia, supervised by Dr. Yu Zhang (2024-present). He received his M.S. from The University of Queens...