MindFlow: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations

Haoxian Zhang; Hejia Chen; Pengfei Wan; Shoulong Zhang; Shuai Li; Xiaoqiang Liu; Xu He

arxiv: 2606.27779 · v1 · pith:Q4ZU4BCAnew · submitted 2026-06-26 · 💻 cs.CV

MindFlow: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations

Hejia Chen , Haoxian Zhang , Xu He , Xiaoqiang Liu , Pengfei Wan , Shoulong Zhang , Shuai Li This is my paper

Pith reviewed 2026-06-29 05:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords facial animationdyadic conversationemotional state chainchunk-state modelingflow matchingacoustic dynamicsventral dorsal pathways

0 comments

The pith

MindFlow generates more natural facial animations for conversations by modeling audio as an evolving emotional state chain instead of whole sentences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MindFlow as a dual-pathway system for creating realistic face movements during two-speaker talks. One pathway processes audio in smaller chunks to track shifting emotions and meaning within an utterance. The other pathway uses those states to drive precise motion generation while handling both talking and listening roles. This replaces earlier sentence-level methods that miss mid-utterance changes. Tests show the resulting animations better match dialogue context and appear more fluid than existing approaches.

Core claim

MindFlow decouples generation into a Ventral module that converts the Sentence-Action approach into a Chunk-State model of raw acoustic streams as a context-aware evolving emotional state chain and a Dorsal module that employs a conditional autoregressive flow matching network for facial motion driven by high-frequency cues and modulated by emotion states together with a Selective Acoustic Injector for adaptive gating, yielding superior semantic appropriateness and motion naturalness over baselines.

What carries the argument

Dual-pathway framework consisting of a Ventral Chunk-State emotional chain from acoustics and a Dorsal conditional autoregressive flow matching network with selective acoustic injection.

If this is right

Facial animations can reflect emotional transitions that occur inside a single spoken sentence.
The system maintains robustness when one participant is speaking and the other is listening without audio crosstalk.
High-frequency sound details can be injected selectively to refine motion without disrupting overall coherence.
The approach produces measurable gains in both meaning alignment and physical plausibility over prior sentence-action baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The chunk-state idea could be adapted to generate body gestures or head movements that stay consistent with spoken emotion flow.
Longer multi-turn conversations might expose whether the emotional state chain needs explicit memory mechanisms to avoid drift.
Pairing the acoustic state chain with text-based dialogue planners could tighten semantic control further.

Load-bearing premise

The assumption that chunk-based modeling of raw acoustic streams as an evolving emotional state chain will capture subtle paralinguistic nuances and mid-utterance emotional shifts that sentence-level approaches miss.

What would settle it

A head-to-head evaluation on a test set of dyadic conversations containing rapid within-utterance emotional changes, checking whether MindFlow animations receive lower human ratings for naturalness or semantic fit than the strongest baseline.

Figures

Figures reproduced from arXiv: 2606.27779 by Haoxian Zhang, Hejia Chen, Pengfei Wan, Shoulong Zhang, Shuai Li, Xiaoqiang Liu, Xu He.

**Figure 1.** Figure 1: Grounded in the Ventral-Dorsal dual-pathway cognitive model, MindFlow introduces a novel framework for streaming facial animation in dyadic conversations, under which digital avatars simultaneously perceive conversational emotions while reflexively synchronizing with acoustic rhythms, naturally yielding interactions that are both semantically rich and physically fluid. Abstract. Generating lifelike facial… view at source ↗

**Figure 2.** Figure 2: Inspired by the Ventral-Dorsal dual-pathway model, MindFlow generates lifelike conversational facial animations (both listening and speaking) driven by continuous dialogue audio Aa and Ab. 1) The Ventral module functions as a cognitive semantic perceiver, leveraging an MLLM with a streaming Chain-of-State to decode evolving emotion states chunk-by-chunk, conditioned on the historical cognitive trajectory.… view at source ↗

**Figure 3.** Figure 3: Visualization of Ventral module reasoning and corresponding generated result. The MetaHuman plugin in Unreal Engine is utilized for rendering. Formally, at step k, the Ventral module predicts the current state S k a conditioned not only on the current audio chunk Ak (as mixed chunk Awk:wk+w a and A wk:wk+w b from both interlocutors) but also on the history of past audio inputs and previously inferred stat… view at source ↗

**Figure 4.** Figure 4: Attention map showing motion query focusing on relevant audio. Autoregressive Transformer Backbone. To effectively integrate the stylistic and acoustic representations, our backbone is constructed by alternately stacking the aforementioned modules. Specifically, the network comprises L = 6 identical blocks, where each block sequentially applies the Stylistic Temporal Modulator and the Selective Acoustic I… view at source ↗

**Figure 5.** Figure 5: Performance under conversational tension [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Compared to the Sentence-Action approach, our proposed Chunk-State approach yields reactions with more appropriate timing and style. 33% 17% 50% 30% 10% 60% 33% 67% 30% 70% A2P DT Ours CL Naturalness Fitness Ours [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 8.** Figure 8: Impact of chunk size. Selective Acoustic Injector. We validate this injector by comparing our method and a retrained A2P [33] against two cross-ablated variants: (1) A2P augmented with our proposed injector, and (2) our method using the A2P acoustic module. As shown in Tab. 2b, incorporating the injector consistently improves lip-sync quality across architectures. This demonstrates its effectiveness in sel… view at source ↗

**Figure 9.** Figure 9: A failure case of head-pose drift: the drift appears after only a few inference steps. Frames are displayed at 2.5 fps for clarity. To address this issue, we train the Dorsal module to predict the angular velocity of the head pose instead of the absolute Euler angles. This formulation not only prevents the module from exploiting such shortcuts but also aligns better with the intrinsic correlation between a… view at source ↗

**Figure 10.** Figure 10: Impact of sampling step on synthesis quality and inference latency D Ethics Statement MindFlow introduces a dual-pathway framework for generating streaming 3D facial animations in dyadic conversations. By harmonizing cognitive semantics with acoustic dynamics, this model enables digital avatars to achieve natural, context-aware, and expressive interactions. While this technology offers significant benefi… view at source ↗

read the original abstract

Generating lifelike facial animation for dyadic conversations requires reconciling high-level cognitive intent with precise low-level motor reflexes, yet existing methods fall short in the semantic understanding of dialogue context and in precise dynamic control. In this paper, we propose MindFlow, a dual-pathway generative framework inspired by the Ventral-Dorsal pathway model in neuroscience, which decouples generation into two collaborative streams, thereby harmonizing deep semantic reasoning with fine-grained control. In the Ventral module, we transform the conventional Sentence-Action approach into a novel Chunk-State approach that models raw acoustic streams as a context-aware, evolving emotional state chain, capturing subtle paralinguistic nuances and mid-utterance emotional shifts missed by sentence-level modeling. The Dorsal module features a conditional autoregressive flow matching network for high-fidelity facial motion, driven by high-frequency acoustic cues and modulated by emotion states, plus a Selective Acoustic Injector for adaptive audio gating to ensure robustness in talking-and-listening dynamics without interference. Extensive experiments demonstrate that MindFlow achieves superior semantic appropriateness and motion naturalness compared to state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MindFlow sketches a dual-pathway architecture for conversational facial animation but the abstract supplies zero numbers or implementation details, so the superiority claim cannot be checked.

read the letter

The abstract lays out MindFlow as a dual-pathway model that splits semantic reasoning from motion control, borrowing the ventral-dorsal split from neuroscience. The ventral side replaces sentence-level processing with a chunk-state model that turns raw audio into an evolving emotional state chain, while the dorsal side runs conditional autoregressive flow matching plus a selective acoustic injector for gating. Those two named pieces are the clearest new elements.

The paper does a reasonable job spelling out the motivation: sentence-level models miss mid-utterance shifts and paralinguistic cues, and a clean separation of pathways might help in back-and-forth conversation. The high-level framing is coherent on its own terms.

The soft spots are exactly where the stress-test note says. No chunk boundaries, state representation, transition rules, or loss functions are given, so it is impossible to tell whether the chunk-state change actually captures the claimed nuances or just re-labels existing audio processing. The abstract asserts better semantic appropriateness and motion naturalness than baselines, yet supplies no metrics, datasets, statistical tests, or even baseline names. Without those, the central claim stays unevaluated. The neuroscience inspiration is stated but not shown to produce measurable gains over prior dual-stream animation work.

This is aimed at researchers building talking-head systems for games, virtual agents, or HCI. A reader already working on acoustic-to-face pipelines might pick up the injector idea or the chunk framing as a prompt for their own experiments. It does not look like a foundational result.

I would send it to peer review only if the full manuscript contains quantitative comparisons, ablations on the chunk mechanism, and reproducible training details; the abstract alone does not yet justify referee time.

Referee Report

2 major / 1 minor

Summary. The paper proposes MindFlow, a dual-pathway generative framework for facial animation in dyadic conversations, inspired by the Ventral-Dorsal pathway model in neuroscience. The Ventral module transforms the Sentence-Action approach into a Chunk-State model that treats raw acoustic streams as an evolving emotional state chain to capture paralinguistic nuances and mid-utterance shifts. The Dorsal module uses a conditional autoregressive flow matching network for facial motion, modulated by emotion states and using a Selective Acoustic Injector for audio gating. The paper claims that extensive experiments show MindFlow achieves superior semantic appropriateness and motion naturalness compared to state-of-the-art baselines.

Significance. If the empirical claims hold with proper validation, the work could advance conversational facial animation by better handling dynamic emotional shifts via the neuroscience-inspired separation of semantic reasoning and motion control. The Chunk-State modeling, if substantiated, addresses a plausible limitation of sentence-level approaches, but the significance hinges on the missing quantitative evidence and implementation details.

major comments (2)

[Abstract] Abstract: the claim that MindFlow 'achieves superior semantic appropriateness and motion naturalness compared to state-of-the-art baselines' supplies no quantitative results, metrics, baselines, statistical tests, or dataset details, rendering the central empirical claim unevaluable from the provided text.
[Ventral module description] Ventral module (Chunk-State approach): the asserted advantage for capturing 'subtle paralinguistic nuances and mid-utterance emotional shifts missed by sentence-level modeling' rests on an unverified modeling assumption; no definition of chunk boundaries, state representation, transition dynamics, or training objective is supplied, which is load-bearing for the claimed improvement over conventional Sentence-Action methods.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one concrete metric or dataset name to ground the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that improve clarity and completeness without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that MindFlow 'achieves superior semantic appropriateness and motion naturalness compared to state-of-the-art baselines' supplies no quantitative results, metrics, baselines, statistical tests, or dataset details, rendering the central empirical claim unevaluable from the provided text.

Authors: We agree that the abstract, as a concise summary, does not contain the requested quantitative details. In the revised manuscript we will augment the abstract with the primary metrics (e.g., semantic appropriateness and motion naturalness scores), the key baselines, the dataset, and a brief statement on statistical significance to make the central empirical claim directly evaluable. revision: yes
Referee: [Ventral module description] Ventral module (Chunk-State approach): the asserted advantage for capturing 'subtle paralinguistic nuances and mid-utterance emotional shifts missed by sentence-level modeling' rests on an unverified modeling assumption; no definition of chunk boundaries, state representation, transition dynamics, or training objective is supplied, which is load-bearing for the claimed improvement over conventional Sentence-Action methods.

Authors: The current manuscript description of the Chunk-State model is indeed high-level and lacks the explicit implementation details noted. We will revise the Ventral module section to supply precise definitions: chunk boundaries as overlapping acoustic segments delimited by energy-based silence detection, state representation as a sequence of 128-dimensional emotional embedding vectors, transition dynamics as a gated recurrent update, and the training objective as a joint flow-matching plus emotion-classification loss. These additions will substantiate the claimed advantages over sentence-level modeling. revision: yes

Circularity Check

0 steps flagged

No circularity: framework presented as independent neuroscience-inspired construction

full rationale

The abstract and provided text describe MindFlow as a dual-pathway generative framework with Ventral (Chunk-State acoustic modeling) and Dorsal (autoregressive flow matching) modules. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The Chunk-State transformation is introduced as a novel modeling choice without reducing to prior self-referential results or definitions. Performance superiority is asserted via experiments rather than by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, mathematical axioms, or new postulated entities are described in sufficient detail to enumerate.

pith-pipeline@v0.9.1-grok · 5743 in / 1127 out tokens · 24055 ms · 2026-06-29T05:02:06.757401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 4 internal anchors

[1]

In: NeurIPS

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. In: NeurIPS. pp. 12449–12460 (2020)

2020
[2]

In: NeurIPS

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: NeurIPS. vol. 33, pp. 1877–1901 (2020)

1901
[3]

In: ICLR (2025)

Cen, Z., Pi, H., Peng, S., Shuai, Q., Shen, Y., Bao, H., Zhou, X., Hu, R.: Ready-to- React: Online reaction policy for two-character interaction generation. In: ICLR (2025)

2025
[4]

In: ICCV

Chatziagapi, A., Morency, L.P., Gong, H., Zollhöfer, M., Samaras, D., Richard, A.: Av-flow: Transforming text to audio-visual human-like interactions. In: ICCV. pp. 14270–14282 (2025) 20 H. Chen et al

2025
[5]

arXiv preprint arXiv:2512.24408 (2025)

Chen, B., Liu, H.: Dystream: Streaming dyadic talking heads generation via flow matching-based autoregressive model. arXiv preprint arXiv:2512.24408 (2025)

work page arXiv 2025
[6]

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion (2024)

2024
[7]

In: ICLR (2025)

Chen, H., Zhang, H., Zhang, S., Liu, X., Zhuang, S., Wan, P., ZHANG, D., Li, S., et al.: Cafe-talk: Generating 3d talking face animation with multimodal coarse-and fine-grained control. In: ICLR (2025)

2025
[8]

Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025

Chen, M., Cui, L., Zhang, W., Zhang, H., Zhou, Y., Li, X., Tang, S., Liu, J., Liao, B., Chen, H., et al.: Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation. arXiv preprint arXiv:2508.19320 (2025)

work page arXiv 2025
[9]

arXiv preprint arXiv:2512.20156 (2025)

Chen, Q., Cheng, L., Deng, C., Li, X., Liu, J., Tan, C.H., Wang, W., Xu, J., Ye, J., Zhang, Q., Zhang, Q., Zhou, J.: Fun-audio-chat technical report. arXiv preprint arXiv:2512.20156 (2025)

work page arXiv 2025
[10]

In: ACCV 2016 Workshops

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: ACCV 2016 Workshops. pp. 251–263. Springer (2017)

2016
[11]

In: ECCV

Fan, X., Li, J., Lin, Z., Xiao, W., Yang, L.: Unitalker: Scaling up audio-driven 3d facial animation through a unified model. In: ECCV. pp. 204–221 (2024)

2024
[12]

In: CVPR

Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: FaceFormer: Speech-driven 3d facial animation with transformers. In: CVPR. pp. 18770–18780 (2022)

2022
[13]

ACM ToG40(4) (2021)

Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3d face model from in-the-wild images. ACM ToG40(4) (2021)

2021
[14]

International Journal of Human-Computer Studies204, 103585 (2025)

Gao, Y., Dai, Y., Zhang, G., Guo, H., Hao, A., Li, S.: Effects of interaction modal- ities and emotional states on user’s perceived empathy with an llm-based embod- ied conversational agent. International Journal of Human-Computer Studies204, 103585 (2025)

2025
[15]

In: ACM MM

Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Conditioned generation of 3d human motions. In: ACM MM. pp. 2021–2029 (2020)

2021
[16]

In: ICCV

Guo, Y., Liu, X., Zhen, C., Yan, P., Wei, X.: ARIG: Autoregressive interactive head generation for real-time conversations. In: ICCV. pp. 12956–12965 (2025)

2025
[17]

In: CVPR

He, X., Huang, Q., Zhang, Z., Lin, Z., Wu, Z., Yang, S., Li, M., Chen, Z., Xu, S., Wu, X.: Co-speech gesture video generation via motion-decoupled diffusion model. In: CVPR. pp. 2263–2273 (2024)

2024
[18]

arXiv preprint arXiv:2512.25066 (2025)

He, X., Zhang, H., Chen, H., Zheng, C., Chen, L., Tang, S., Huang, J., Liu, X., Wan, P., Wu, Z.: From inpainting to editing: A self-bootstrapping framework for context-rich visual dubbing. arXiv preprint arXiv:2512.25066 (2025)

work page arXiv 2025
[19]

Nature reviews neuroscience8(5), 393–402 (2007)

Hickok, G., Poeppel, D.: The cortical organization of speech processing. Nature reviews neuroscience8(5), 393–402 (2007)

2007
[20]

In: NeurIPS

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)

2020
[21]

In: ACM MM

Hu, Y., Liu, R., Ren, Y., Yin, X., Li, H.: Unitalker: Conversational speech-visual synthesis. In: ACM MM. pp. 10248–10257 (2025)

2025
[22]

arXiv preprint arXiv:2406.18284 (2024)

Ji, X., Lin, C., Ding, Z., Tai, Y., Zhu, J., Hu, X., Luo, D., Ge, Y., Wang, C.: Realtalk: Real-time and realistic audio-driven face generation with 3d facial prior- guided identity alignment network. arXiv preprint arXiv:2406.18284 (2024)

work page arXiv 2024
[23]

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Ki, T., Jang, S., Jo, J., Yoon, J., Hwang, S.J.: Avatar forcing: Real-time interactive head avatar generation for natural conversation. arXiv preprint arXiv:2601.00664 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Frontiers in Virtual Reality2, 786665 (2022) MindFlow 21

Kyrlitsias, C., Michael-Grigoriou, D.: Social interaction with agents and avatars in immersive virtual environments: A survey. Frontiers in Virtual Reality2, 786665 (2022) MindFlow 21

2022
[25]

In: CVPR

Lai, P., Zhong, W., Qin, Y., Ren, X., Wang, B., Li, G.: LLM-driven multimodal and multi-identity listening head generation. In: CVPR. pp. 10656–10666 (2025)

2025
[26]

Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

Li, C., Zhang, C., Xu, W., Lin, J., Xie, J., Feng, W., Peng, B., Chen, C., Xing, W.: LatentSync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision. arXiv preprint arXiv:2412.09262 (2024)

work page arXiv 2024
[27]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

In: ACM MM

Liu, M., Wang, J., Qian, X., Li, H.: Listenformer: Responsive listening head gener- ation with non-autoregressive transformers. In: ACM MM. pp. 7094–7103 (2024)

2024
[29]

In: CVPR

Liu, X., Guo, Y., Zhen, C., Li, T., Ao, Y., Yan, P.: Customlistener: Text-guided responsive interaction for user-friendly listening head generation. In: CVPR. pp. 2415–2424 (2024)

2024
[30]

MediaPipe: A Framework for Building Perception Pipelines

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[31]

In: NeurIPS (2025)

Luo, C., Wang, J., Li, B., Song, S., Ghanem, B.: Omniresponse: Online multimodal conversational response generation in dyadic interactions. In: NeurIPS (2025)

2025
[32]

In: CVPR

Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion. In: CVPR. pp. 20395– 20405 (2022)

2022
[33]

In: CVPR

Ng, E., Romero, J., Bagautdinov, T., Bai, S., Darrell, T., Kanazawa, A., Richard, A.: From audio to photoreal embodiment: Synthesizing humans in conversations. In: CVPR. pp. 1001–1010 (2024)

2024
[34]

In: ACL (2024)

Park, S., Kim, C., Rha, H., Kim, M., Hong, J., Yeo, J., Ro, Y.: Let’s go real talk: Spoken dialogue model for face-to-face conversation. In: ACL (2024)

2024
[35]

In: CVPR

Peng, Z., Fan, Y., Wu, H., Wang, X., Liu, H., He, J., Fan, Z.: Dualtalk: Dual- speaker interaction for 3d talking head conversations. In: CVPR. pp. 21055–21064 (2025)

2025
[36]

In: ICCV

Peng, Z., Wu, H., Song, Z., Xu, H., Zhu, X., He, J., Liu, H., Fan, Z.: Emotalk: Speech-driven emotional disentanglement for 3d face animation. In: ICCV. pp. 20687–20697 (2023)

2023
[37]

In: 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)

Pratticó,F.G.,Shabkhoslati,J.A.,Shaghaghi,N.,Lamberti,F.:Botundercover:on the use of conversational agents to stimulate teacher-students interaction in remote learning. In: 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). pp. 277–282. IEEE (2022)

2022
[38]

In: ICCV

Siniukov, M., Chang, D., Tran, M., Gong, H., Chaubey, A., Soleymani, M.: Di- TaiListener: Controllable high fidelity listener video generation with diffusion. In: ICCV. pp. 11991–12001 (2025)

2025
[39]

arXiv preprint arXiv:2512.22065 (2025)

Sun, Z., Peng, Z., Ma, Y., Chen, Y., Zhou, Z., Zhou, Z., Zhang, G., Zhang, Y., Zhou, Y., Lu, Q., et al.: Streamavatar: Streaming diffusion models for real-time interactive human avatars. arXiv preprint arXiv:2512.22065 (2025)

work page arXiv 2025
[40]

In: ECCV

Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao, Y., Loy, C.C.: MEAD: A large-scale audio-visual dataset for emotional talking-face gener- ation. In: ECCV. pp. 700–717. Springer (2020)

2020
[41]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, R., Ma, C., Li, G., Xu, H., Li, Y., Wang, Z.: You think, you act: The new task of arbitrary text to motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12012–12022 (2025)

2025
[42]

arXiv preprint arXiv:2509.21574 (2025)

Xie, Y., Gu, T., Li, Z., Zhang, C., Song, G., Zhao, X., Liang, C., Jiang, J., Xu, H., Luo, L.: X-streamer: Unified human world modeling with audiovisual interaction. arXiv preprint arXiv:2509.21574 (2025) 22 H. Chen et al

work page arXiv 2025
[43]

Qwen3-Omni Technical Report

Xu, J., Guo, Z., Hu, H., Chu, Y., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

In: SIGGRAPH Asia

Xue, H., Fan, Y., Wang, X., Wu, Z.: Echo: Enhancing conversational behavior generation via hierarchical semantic comprehension with large language models. In: SIGGRAPH Asia. pp. 1–9 (2025)

2025
[45]

In: CVPR

Yang, T.Y., Chen, Y.T., Lin, Y.Y., Chuang, Y.Y.: Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image. In: CVPR. pp. 1087–1096 (2019)

2019
[46]

In: AAAI (2025)

Yang, Y., Cen, Z., Peng, S., Chen, X., Deng, Y., Zhu, X., Jia, F., Zhou, X., Bao, H.: StreamingTalker: Audio-driven 3d facial animation with autoregressive diffusion model. In: AAAI (2025)

2025
[47]

arXiv preprint arXiv:2410.10122 (2024)

Zhang, Y., Zhong, Z., Liu, M., Chen, Z., Wu, B., Zeng, Y., Zhan, C., He, Y., Huang, J., Zhou, W.: Musetalk: Real-time high-fidelity video dubbing via spatio-temporal sampling. arXiv preprint arXiv:2410.10122 (2024)

work page arXiv 2024
[48]

In: ACM SIGGRAPH Asia

Zhang, Z., Zhou, Y., Yao, H., Ao, T., Zhan, X., Liu, L.: Social agent: Master- ing dyadic nonverbal behavior generation via conversational llm agents. In: ACM SIGGRAPH Asia. pp. 1–12 (2025)

2025
[49]

In: CVPR

Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: CVPR. pp. 3661–3670 (2021)

2021
[50]

IEEE TVCG (2026)

Zhao, X., Dai, J., Zhou, F., Wang, H., Song, Z., Hao, A., Qin, H., Gao, Y.: Emo- poseface: Head pose aware speech-driven 3d emotional facial animation using latent diffusion. IEEE TVCG (2026)

2026
[51]

arXiv preprint arXiv:2511.23475 (2025)

Zhong, Z., Ji, Y., Kong, Z., Liu, Y., Wang, J., Feng, J., Liu, L., Wang, X., Li, Y., She, Y., et al.: Anytalker: Scaling multi-person talking video generation with interactivity refinement. arXiv preprint arXiv:2511.23475 (2025)

work page arXiv 2025
[52]

TPAMI (2025)

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T.: Interactive conversational head generation. TPAMI (2025)

2025
[53]

In: ECCV

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Responsive listening head generation: a benchmark dataset and baseline. In: ECCV. Springer (2022)

2022
[54]

In: CVPR

Zhu, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: INFP: Audio-driven inter- active head generation in dyadic conversations. In: CVPR. pp. 10667–10677 (2025)

2025

[1] [1]

In: NeurIPS

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. In: NeurIPS. pp. 12449–12460 (2020)

2020

[2] [2]

In: NeurIPS

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: NeurIPS. vol. 33, pp. 1877–1901 (2020)

1901

[3] [3]

In: ICLR (2025)

Cen, Z., Pi, H., Peng, S., Shuai, Q., Shen, Y., Bao, H., Zhou, X., Hu, R.: Ready-to- React: Online reaction policy for two-character interaction generation. In: ICLR (2025)

2025

[4] [4]

In: ICCV

Chatziagapi, A., Morency, L.P., Gong, H., Zollhöfer, M., Samaras, D., Richard, A.: Av-flow: Transforming text to audio-visual human-like interactions. In: ICCV. pp. 14270–14282 (2025) 20 H. Chen et al

2025

[5] [5]

arXiv preprint arXiv:2512.24408 (2025)

Chen, B., Liu, H.: Dystream: Streaming dyadic talking heads generation via flow matching-based autoregressive model. arXiv preprint arXiv:2512.24408 (2025)

work page arXiv 2025

[6] [6]

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion (2024)

2024

[7] [7]

In: ICLR (2025)

Chen, H., Zhang, H., Zhang, S., Liu, X., Zhuang, S., Wan, P., ZHANG, D., Li, S., et al.: Cafe-talk: Generating 3d talking face animation with multimodal coarse-and fine-grained control. In: ICLR (2025)

2025

[8] [8]

Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025

Chen, M., Cui, L., Zhang, W., Zhang, H., Zhou, Y., Li, X., Tang, S., Liu, J., Liao, B., Chen, H., et al.: Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation. arXiv preprint arXiv:2508.19320 (2025)

work page arXiv 2025

[9] [9]

arXiv preprint arXiv:2512.20156 (2025)

Chen, Q., Cheng, L., Deng, C., Li, X., Liu, J., Tan, C.H., Wang, W., Xu, J., Ye, J., Zhang, Q., Zhang, Q., Zhou, J.: Fun-audio-chat technical report. arXiv preprint arXiv:2512.20156 (2025)

work page arXiv 2025

[10] [10]

In: ACCV 2016 Workshops

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: ACCV 2016 Workshops. pp. 251–263. Springer (2017)

2016

[11] [11]

In: ECCV

Fan, X., Li, J., Lin, Z., Xiao, W., Yang, L.: Unitalker: Scaling up audio-driven 3d facial animation through a unified model. In: ECCV. pp. 204–221 (2024)

2024

[12] [12]

In: CVPR

Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: FaceFormer: Speech-driven 3d facial animation with transformers. In: CVPR. pp. 18770–18780 (2022)

2022

[13] [13]

ACM ToG40(4) (2021)

Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3d face model from in-the-wild images. ACM ToG40(4) (2021)

2021

[14] [14]

International Journal of Human-Computer Studies204, 103585 (2025)

Gao, Y., Dai, Y., Zhang, G., Guo, H., Hao, A., Li, S.: Effects of interaction modal- ities and emotional states on user’s perceived empathy with an llm-based embod- ied conversational agent. International Journal of Human-Computer Studies204, 103585 (2025)

2025

[15] [15]

In: ACM MM

Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Conditioned generation of 3d human motions. In: ACM MM. pp. 2021–2029 (2020)

2021

[16] [16]

In: ICCV

Guo, Y., Liu, X., Zhen, C., Yan, P., Wei, X.: ARIG: Autoregressive interactive head generation for real-time conversations. In: ICCV. pp. 12956–12965 (2025)

2025

[17] [17]

In: CVPR

He, X., Huang, Q., Zhang, Z., Lin, Z., Wu, Z., Yang, S., Li, M., Chen, Z., Xu, S., Wu, X.: Co-speech gesture video generation via motion-decoupled diffusion model. In: CVPR. pp. 2263–2273 (2024)

2024

[18] [18]

arXiv preprint arXiv:2512.25066 (2025)

He, X., Zhang, H., Chen, H., Zheng, C., Chen, L., Tang, S., Huang, J., Liu, X., Wan, P., Wu, Z.: From inpainting to editing: A self-bootstrapping framework for context-rich visual dubbing. arXiv preprint arXiv:2512.25066 (2025)

work page arXiv 2025

[19] [19]

Nature reviews neuroscience8(5), 393–402 (2007)

Hickok, G., Poeppel, D.: The cortical organization of speech processing. Nature reviews neuroscience8(5), 393–402 (2007)

2007

[20] [20]

In: NeurIPS

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)

2020

[21] [21]

In: ACM MM

Hu, Y., Liu, R., Ren, Y., Yin, X., Li, H.: Unitalker: Conversational speech-visual synthesis. In: ACM MM. pp. 10248–10257 (2025)

2025

[22] [22]

arXiv preprint arXiv:2406.18284 (2024)

Ji, X., Lin, C., Ding, Z., Tai, Y., Zhu, J., Hu, X., Luo, D., Ge, Y., Wang, C.: Realtalk: Real-time and realistic audio-driven face generation with 3d facial prior- guided identity alignment network. arXiv preprint arXiv:2406.18284 (2024)

work page arXiv 2024

[23] [23]

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Ki, T., Jang, S., Jo, J., Yoon, J., Hwang, S.J.: Avatar forcing: Real-time interactive head avatar generation for natural conversation. arXiv preprint arXiv:2601.00664 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Frontiers in Virtual Reality2, 786665 (2022) MindFlow 21

Kyrlitsias, C., Michael-Grigoriou, D.: Social interaction with agents and avatars in immersive virtual environments: A survey. Frontiers in Virtual Reality2, 786665 (2022) MindFlow 21

2022

[25] [25]

In: CVPR

Lai, P., Zhong, W., Qin, Y., Ren, X., Wang, B., Li, G.: LLM-driven multimodal and multi-identity listening head generation. In: CVPR. pp. 10656–10666 (2025)

2025

[26] [26]

Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

Li, C., Zhang, C., Xu, W., Lin, J., Xie, J., Feng, W., Peng, B., Chen, C., Xing, W.: LatentSync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision. arXiv preprint arXiv:2412.09262 (2024)

work page arXiv 2024

[27] [27]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

In: ACM MM

Liu, M., Wang, J., Qian, X., Li, H.: Listenformer: Responsive listening head gener- ation with non-autoregressive transformers. In: ACM MM. pp. 7094–7103 (2024)

2024

[29] [29]

In: CVPR

Liu, X., Guo, Y., Zhen, C., Li, T., Ao, Y., Yan, P.: Customlistener: Text-guided responsive interaction for user-friendly listening head generation. In: CVPR. pp. 2415–2424 (2024)

2024

[30] [30]

MediaPipe: A Framework for Building Perception Pipelines

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[31] [31]

In: NeurIPS (2025)

Luo, C., Wang, J., Li, B., Song, S., Ghanem, B.: Omniresponse: Online multimodal conversational response generation in dyadic interactions. In: NeurIPS (2025)

2025

[32] [32]

In: CVPR

Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion. In: CVPR. pp. 20395– 20405 (2022)

2022

[33] [33]

In: CVPR

Ng, E., Romero, J., Bagautdinov, T., Bai, S., Darrell, T., Kanazawa, A., Richard, A.: From audio to photoreal embodiment: Synthesizing humans in conversations. In: CVPR. pp. 1001–1010 (2024)

2024

[34] [34]

In: ACL (2024)

Park, S., Kim, C., Rha, H., Kim, M., Hong, J., Yeo, J., Ro, Y.: Let’s go real talk: Spoken dialogue model for face-to-face conversation. In: ACL (2024)

2024

[35] [35]

In: CVPR

Peng, Z., Fan, Y., Wu, H., Wang, X., Liu, H., He, J., Fan, Z.: Dualtalk: Dual- speaker interaction for 3d talking head conversations. In: CVPR. pp. 21055–21064 (2025)

2025

[36] [36]

In: ICCV

Peng, Z., Wu, H., Song, Z., Xu, H., Zhu, X., He, J., Liu, H., Fan, Z.: Emotalk: Speech-driven emotional disentanglement for 3d face animation. In: ICCV. pp. 20687–20697 (2023)

2023

[37] [37]

In: 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)

Pratticó,F.G.,Shabkhoslati,J.A.,Shaghaghi,N.,Lamberti,F.:Botundercover:on the use of conversational agents to stimulate teacher-students interaction in remote learning. In: 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). pp. 277–282. IEEE (2022)

2022

[38] [38]

In: ICCV

Siniukov, M., Chang, D., Tran, M., Gong, H., Chaubey, A., Soleymani, M.: Di- TaiListener: Controllable high fidelity listener video generation with diffusion. In: ICCV. pp. 11991–12001 (2025)

2025

[39] [39]

arXiv preprint arXiv:2512.22065 (2025)

Sun, Z., Peng, Z., Ma, Y., Chen, Y., Zhou, Z., Zhou, Z., Zhang, G., Zhang, Y., Zhou, Y., Lu, Q., et al.: Streamavatar: Streaming diffusion models for real-time interactive human avatars. arXiv preprint arXiv:2512.22065 (2025)

work page arXiv 2025

[40] [40]

In: ECCV

Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao, Y., Loy, C.C.: MEAD: A large-scale audio-visual dataset for emotional talking-face gener- ation. In: ECCV. pp. 700–717. Springer (2020)

2020

[41] [41]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, R., Ma, C., Li, G., Xu, H., Li, Y., Wang, Z.: You think, you act: The new task of arbitrary text to motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12012–12022 (2025)

2025

[42] [42]

arXiv preprint arXiv:2509.21574 (2025)

Xie, Y., Gu, T., Li, Z., Zhang, C., Song, G., Zhao, X., Liang, C., Jiang, J., Xu, H., Luo, L.: X-streamer: Unified human world modeling with audiovisual interaction. arXiv preprint arXiv:2509.21574 (2025) 22 H. Chen et al

work page arXiv 2025

[43] [43]

Qwen3-Omni Technical Report

Xu, J., Guo, Z., Hu, H., Chu, Y., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

In: SIGGRAPH Asia

Xue, H., Fan, Y., Wang, X., Wu, Z.: Echo: Enhancing conversational behavior generation via hierarchical semantic comprehension with large language models. In: SIGGRAPH Asia. pp. 1–9 (2025)

2025

[45] [45]

In: CVPR

Yang, T.Y., Chen, Y.T., Lin, Y.Y., Chuang, Y.Y.: Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image. In: CVPR. pp. 1087–1096 (2019)

2019

[46] [46]

In: AAAI (2025)

Yang, Y., Cen, Z., Peng, S., Chen, X., Deng, Y., Zhu, X., Jia, F., Zhou, X., Bao, H.: StreamingTalker: Audio-driven 3d facial animation with autoregressive diffusion model. In: AAAI (2025)

2025

[47] [47]

arXiv preprint arXiv:2410.10122 (2024)

Zhang, Y., Zhong, Z., Liu, M., Chen, Z., Wu, B., Zeng, Y., Zhan, C., He, Y., Huang, J., Zhou, W.: Musetalk: Real-time high-fidelity video dubbing via spatio-temporal sampling. arXiv preprint arXiv:2410.10122 (2024)

work page arXiv 2024

[48] [48]

In: ACM SIGGRAPH Asia

Zhang, Z., Zhou, Y., Yao, H., Ao, T., Zhan, X., Liu, L.: Social agent: Master- ing dyadic nonverbal behavior generation via conversational llm agents. In: ACM SIGGRAPH Asia. pp. 1–12 (2025)

2025

[49] [49]

In: CVPR

Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: CVPR. pp. 3661–3670 (2021)

2021

[50] [50]

IEEE TVCG (2026)

Zhao, X., Dai, J., Zhou, F., Wang, H., Song, Z., Hao, A., Qin, H., Gao, Y.: Emo- poseface: Head pose aware speech-driven 3d emotional facial animation using latent diffusion. IEEE TVCG (2026)

2026

[51] [51]

arXiv preprint arXiv:2511.23475 (2025)

Zhong, Z., Ji, Y., Kong, Z., Liu, Y., Wang, J., Feng, J., Liu, L., Wang, X., Li, Y., She, Y., et al.: Anytalker: Scaling multi-person talking video generation with interactivity refinement. arXiv preprint arXiv:2511.23475 (2025)

work page arXiv 2025

[52] [52]

TPAMI (2025)

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T.: Interactive conversational head generation. TPAMI (2025)

2025

[53] [53]

In: ECCV

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Responsive listening head generation: a benchmark dataset and baseline. In: ECCV. Springer (2022)

2022

[54] [54]

In: CVPR

Zhu, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: INFP: Audio-driven inter- active head generation in dyadic conversations. In: CVPR. pp. 10667–10677 (2025)

2025