pith. sign in

arxiv: 2606.22905 · v2 · pith:NSJK4GOYnew · submitted 2026-06-22 · 💻 cs.CV

InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

Pith reviewed 2026-07-01 06:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords avatar video generationreal-time streamingvisual consistencyintent-aware interactiondiffusion modelsautoregressive distillationlong-short visual memoryreasoning-reaction module
0
0 comments X

The pith

InteractiveAvatar generates consistent avatar videos in real time while aligning with user intent over arbitrarily long streams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InteractiveAvatar as a framework for real-time infinite-streaming avatar video generation that maintains visual temporal consistency and supports intent-aware interactions. It uses autoregressive distillation to enable generation over arbitrarily long durations without the inconsistencies common in prior diffusion-based approaches. A Long-Short Visual Memory mechanism compresses historical visual information into compact tokens to keep both short-range and long-term coherence. A Reasoning-Reaction Module with State-Cycling and Cache-Switching strategies allows the system to perceive user intent and align avatar speech and actions accordingly. A sympathetic reader would care because this setup could support sustained, natural interactions with virtual characters in streaming scenarios where previous methods lose coherence or misread intent.

Core claim

InteractiveAvatar is a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, it achieves real-time streaming generation of human avatars over arbitrarily long durations. For visual consistency, it introduces a Long-Short Visual Memory mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, it proposes a Reasoning-Reaction Module that incorporates a State-Cycling strategy and a Cache-Switching mechanism.

What carries the argument

Long-Short Visual Memory (LSVM) that compresses historical visual information into compact tokens to preserve short-range coherence and long-term consistency, paired with Reasoning-Reaction Module (RRM) using State-Cycling and Cache-Switching to align avatar speech and actions to perceived user intent.

If this is right

  • Avatar video generation becomes feasible over arbitrarily long durations while staying visually consistent.
  • Complex user-avatar interactions occur in real time with explicit intent perception.
  • State-of-the-art visual consistency holds across diverse interactive scenarios.
  • Real-time performance is sustained through autoregressive distillation without breaking streaming.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The memory compression approach could apply to other real-time video tasks that require long-term coherence beyond avatars.
  • If intent alignment scales reliably, it might support multi-turn conversations in virtual environments more naturally.
  • The overall design could reduce reliance on pre-generated clips for interactive avatar systems.

Load-bearing premise

The Long-Short Visual Memory and Reasoning-Reaction Module deliver the claimed consistency and intent alignment without introducing new artifacts or latency that would break real-time performance.

What would settle it

A 30-minute interactive streaming test with frequent user intent changes, measuring whether avatar appearance remains consistent without drift and whether responses match intents at real-time latency thresholds.

Figures

Figures reproduced from arXiv: 2606.22905 by Caigui Jiang, Chi Zhang, Quanyue Song, Shihao Cheng, Xuelong Li, Yanfei Zhang, Yishan He, Zhixiang He, Zhizhi Guo.

Figure 1
Figure 1. Figure 1: We propose InteractiveAvatar, a real-time streaming audio-driven avatar gen￾eration framework that enables intent-aware interaction. InteractiveAvatar interprets user intent to generate contextually relevant actions throughout the dialogue while maintaining long-range visual consistency. The RRM enhances the realism of user-avatar interaction by leveraging a large language model for intent understanding an… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of InteractiveAvatar, which consists of (a) The Reasoning-Reaction Module (RRM) performs intent-aware interaction with user; (b) Streaming Inference with Long-Short Visual Memory (LSVM) mechanism to enhance the visual consistency; and (c) DMD training for real-time streaming generation. cues, with synchronized but simple gestures. Recent works [8, 19] on interac￾tive avatars have explored audio-dr… view at source ↗
Figure 3
Figure 3. Figure 3: LSVM Mechanism.(a) During training, long-term memory frames are randomly sampled, while short-term memory retains all recent frames. (b) During inference, Dynamic Key-Frame Selection adaptively updates memory to retain critical visual information. generated frames to ensure local temporal coherence, while the long-term mem￾ory stores compact representations of globally salient visual states to stabilize ov… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons with state-of-the-art methods. Our method exhibits better visual consistency and following of action instructions. and aesthetic appeal (ASE). Distribution-level fidelity is measured by FID [11] for frame-wise realism and FVD [29] for overall spatio-temporal coherence. For video consistency, we measure audio-visual synchronization using SynC and SynD [5], capturing the correspondenc… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons with state-of-the-art methods. Our method exhibits better visual consistency and following of action instructions. Qualitative visualizations further compare our method with OmniAvatar [10], WanS2V [31] and LiveAvatar [14] in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation of InteractiveAvatar. Ablation studies show that our Full model maintains the best visual consistency and enables more realistic interactions. Selection with random sampling (w/o DKFS) causes slight distortions in the watch face, highlighting the advantage of informed memory updates. Remov￾ing the entire LSVM module (w/o LSVM) leads to a significant drop in OBJ, confirming its importan… view at source ↗
read the original abstract

Recent diffusion-based models have enabled realistic audio-driven avatar generation in real-time streaming. However, existing approaches struggle to maintain visual temporal consistency and fail to explicitly perceive user intent in complex interactive streaming scenarios. To address these challenges, we propose InteractiveAvatar, a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, InteractiveAvatar achieves real-time str-eaming generation of human avatars over arbitrarily long durations. For visual consistency, we introduce a Long-Short Visual Memory (LSVM) mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, we propose a Reasoning-Reaction Module (RRM), which incorporates a State-Cycling strategy and a Cache-Switching mechanism. Extensive experimental results over diverse scenarios demonstrate that our method achieves state-of-the-art visual consistency in long-duration generation, while enabling complex user-avatar interaction in real time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper presents InteractiveAvatar, a real-time infinite-streaming video generation framework for human avatars. It addresses temporal inconsistency and intent misalignment in diffusion-based audio-driven models via autoregressive distillation for long-duration generation, a Long-Short Visual Memory (LSVM) mechanism to compress historical visuals into tokens for short- and long-range coherence, and a Reasoning-Reaction Module (RRM) incorporating State-Cycling and Cache-Switching to align avatar speech and actions with user intent. Experiments over diverse scenarios are claimed to demonstrate state-of-the-art visual consistency and real-time complex interactions.

Significance. If the LSVM and RRM constructions deliver the claimed consistency and alignment without compromising real-time performance or introducing artifacts, the work would advance practical interactive avatar systems for applications such as virtual communication and entertainment by solving key streaming limitations of prior diffusion approaches.

minor comments (1)
  1. [Abstract] Abstract contains a typographical error ('str-eaming' instead of 'streaming').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful summary of InteractiveAvatar and for noting its potential impact on practical interactive avatar systems. The recommendation of 'uncertain' appears to stem from the need for confirmation that LSVM and RRM achieve the stated consistency and alignment without sacrificing real-time performance or introducing artifacts. We address this directly below and clarify that our experiments support these claims.

read point-by-point responses
  1. Referee: If the LSVM and RRM constructions deliver the claimed consistency and alignment without compromising real-time performance or introducing artifacts, the work would advance practical interactive avatar systems for applications such as virtual communication and entertainment by solving key streaming limitations of prior diffusion approaches.

    Authors: Our experiments in Sections 4.2 and 4.3 demonstrate that LSVM preserves both short-range and long-term visual coherence (via quantitative metrics such as temporal consistency scores and user studies) while RRM enables intent-aligned reactions without measurable latency overhead. Real-time performance is maintained at >30 FPS on the reported hardware, and qualitative results across diverse long-duration sequences show no introduced artifacts attributable to the proposed modules. We are happy to add additional ablation tables or latency breakdowns if requested. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and description introduce LSVM and RRM modules plus autoregressive distillation as solutions to temporal consistency and intent alignment, but present no equations, fitted parameters, predictions of derived quantities, or self-citation chains. No derivation steps are described that could reduce to inputs by construction, self-definition, or renaming. The paper's claims are architectural and empirical rather than mathematical reductions, making the derivation self-contained against external benchmarks with no load-bearing circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are specified.

pith-pipeline@v0.9.1-grok · 5734 in / 938 out tokens · 22525 ms · 2026-07-01T06:57:36.765959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

    cs.CV 2026-07 unverdicted novelty 7.0

    DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

Reference graph

Works this paper leans on

49 extracted references · 26 canonical work pages · cited by 1 Pith paper · 13 internal anchors

  1. [1]

    Ai flow: Perspectives, scenarios, and approaches,

    An, H., Hu, W., Huang, S., Huang, S., Li, R., Liang, Y., Shao, J., Song, Y., Wang, Z., Yuan, C., et al.: Ai flow: Perspectives, scenarios, and approaches (2025). arXiv preprint arXiv:2506.12479 (2025)

  2. [2]

    arXiv preprint arXiv:2505.20156 (2025)

    Chen, Y., Liang, S., Zhou, Z., Huang, Z., Ma, Y., Tang, J., Lin, Q., Zhou, Y., Lu, Q.: Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters. arXiv preprint arXiv:2505.20156 (2025)

  3. [3]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Chen, Z., Cao, J., Chen, Z., Li, Y., Ma, C.: Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2403–2410 (2025)

  4. [4]

    Cheng, S., Zhang, J., Song, Q., Liu, S., Guo, Z., Zhang, X., Zhang, C., Li, X., Tu, Z.: Unison: Harmonizing motion, speech, and sound for human-centric audio-video generation (2026),https://arxiv.org/abs/2605.08729

  5. [5]

    In: INTERSPEECH (2018)

    Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. In: INTERSPEECH (2018)

  6. [6]

    In: Asian conference on computer vision

    Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian conference on computer vision. pp. 251–263. Springer (2016)

  7. [7]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21086–21095 (2025)

  8. [8]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

  9. [9]

    arXiv preprint arXiv:2505.10238 (2025)

    Ding, Y., Hu, X., Guo, Z., Zhang, C., Wang, Y.: Mtvcrafter: 4d motion tokenization for open-world human image animation. arXiv preprint arXiv:2505.10238 (2025)

  10. [10]

    arXiv preprint arXiv:2506.18866 (2025)

    Gan, Q., Yang, R., Zhu, J., Xue, S., Hoi, S.: Omniavatar: Efficient audio- driven avatar video generation with adaptive body animation. arXiv preprint arXiv:2506.18866 (2025)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individ- ual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3497–3506 (2019)

  12. [12]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  13. [13]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

  14. [14]

    Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

    Huang, Y., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al.: Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677 (2025)

  15. [15]

    Vicinagearth1(1), 8 (2024)

    Jiang, W., Zhang, Y., Zheng, S., Liu, S., Yan, S.: Data augmentation in human- centric vision. Vicinagearth1(1), 8 (2024)

  16. [16]

    arXiv preprint arXiv:2505.22647 (2025)

    Kong, Z., Gao, F., Zhang, Y., Kang, Z., Wei, X., Cai, X., Chen, G., Luo, W.: Let them talk: Audio-driven multi-person conversational video generation. arXiv preprint arXiv:2505.22647 (2025)

  17. [17]

    arXiv preprint arXiv:2412.00115 (2024) InteractiveAvatar 17

    Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., et al.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. arXiv preprint arXiv:2412.00115 (2024) InteractiveAvatar 17

  18. [18]

    Vicinagearth1(1), 9 (2024)

    Li, X., Wang, S., Zeng, S., Wu, Y., Yang, Y.: A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth1(1), 9 (2024)

  19. [19]

    IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2022)

    Li, X.: Positive-incentive noise. IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2022)

  20. [20]

    Talkingmachines: Real- time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025

    Low, C., Wang, W.: Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099 (2025)

  21. [21]

    arXiv preprint arXiv:2507.03905 (2025)

    Meng, R., Wang, Y., Wu, W., Zheng, R., Li, Y., Ma, C.: Echomimicv3: 1.3 b pa- rameters are all you need for unified multi-modal and multi-task human animation. arXiv preprint arXiv:2507.03905 (2025)

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20395– 20405 (2022)

  23. [23]

    In: Proceedings of the 28th ACM international conference on multimedia

    Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. pp. 484–492 (2020)

  24. [24]

    Journal of machine learning research21(140), 1–67 (2020)

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

  25. [25]

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...

  26. [26]

    arXiv preprint arXiv:2512.22065 (2025)

    Sun, Z., Peng, Z., Ma, Y., Chen, Y., Zhou, Z., Zhou, Z., Zhang, G., Zhang, Y., Zhou, Y., Lu, Q., et al.: Streamavatar: Streaming diffusion models for real-time interactive human avatars. arXiv preprint arXiv:2512.22065 (2025)

  27. [27]

    In: European Conference on Computer Vision

    Tian, L., Wang, Q., Zhang, B., Bo, L.: Emo: Emote portrait alive generating ex- pressive portrait videos with audio2video diffusion model under weak conditions. In: European Conference on Computer Vision. pp. 244–260. Springer (2024)

  28. [28]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025)

  29. [29]

    Tu, S., Pan, Y., Huang, Y., Han, X., Xing, Z., Dai, Q., Luo, C., Wu, Z., Jiang, Y.G.: Stableavatar: Infinite-length audio-driven avatar video generation (2025), https://arxiv.org/abs/2508.08248

  30. [30]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

  31. [31]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  32. [32]

    Wang, J., Wang, C., Huang, K., Huang, J., Jin, L.: Videoclip-xl: Advancing long description understanding for video clip models (2024),https://arxiv.org/abs/ 2410.00741

  33. [33]

    arXiv preprint arXiv:2601.10103 (2026) 18 Q

    Wang, L., Zhu, Y., Ge, Z., Zheng, Y., Zhang, L., Hu, T., Qin, S., Luo, M., Zhang, J., Chen, X., et al.: Flowact-r1: Towards interactive humanoid video generation. arXiv preprint arXiv:2601.10103 (2026) 18 Q. Song et al

  34. [34]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Wang, M., Wang, Q., Jiang, F., Fan, Y., Zhang, Y., Qi, Y., Zhao, K., Xu, M.: Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9891–9900 (2025)

  35. [35]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, Y., Fan, Y., Wang, X., Yu, G., Wang, F.: Diffusion-based realistic listening head generation via hybrid motion modeling. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15885–15895 (2025)

  36. [36]

    Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)

  37. [37]

    In: The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)

    Xie, L., Wang, X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution. In: The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)

  38. [38]

    arXiv preprint arXiv:2509.21574 (2025)

    Xie, Y., Gu, T., Li, Z., Zhang, C., Song, G., Zhao, X., Liang, C., Jiang, J., Xu, H., Luo, L.: X-streamer: Unified human world modeling with audiovisual interaction. arXiv preprint arXiv:2509.21574 (2025)

  39. [39]

    Qwen3-Omni Technical Report

    Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025)

  40. [40]

    Advances in Neural Information Processing Systems37, 660–684 (2024)

    Xu, S., Chen, G., Guo, Y.X., Yang, J., Li, C., Zang, Z., Zhang, Y., Tong, X., Guo, B.: Vasa-1: Lifelike audio-driven talking faces generated in real time. Advances in Neural Information Processing Systems37, 660–684 (2024)

  41. [41]

    LongLive: Real-time Interactive Long Video Generation

    Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

  42. [42]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 22963–22974 (2025)

  43. [43]

    In: CVPR (2023)

    Yu, J., Zhu, H., Jiang, L., Loy, C.C., Cai, W., Wu, W.: CelebV-Text: A large-scale facial text-video dataset. In: CVPR (2023)

  44. [44]

    IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)

    Zhang, H., Huang, S., Guo, Y., Li, X.: Variational positive-incentive noise: How noise benefits models. IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)

  45. [45]

    TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

    Zhang, L., Cai, S., Li, M., Zeng, C., Lu, B., Rao, A., Han, S., Wetzstein, G., Agrawala, M.: Pretraining frame preservation in autoregressive video memory com- pression. arXiv preprint arXiv:2512.23851 (2025)

  46. [46]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8652–8661 (2023)

  47. [47]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3661–3670 (2021)

  48. [48]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Zhao,Y.,Gu,A.,Varma,R.,Luo,L.,Huang,C.C.,Xu,M.,Wright,L.,Shojanazeri, H., Ott, M., Shleifer, S., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023)

  49. [49]

    ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)

    Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)