Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

· 2026 · cs.HC · arXiv 2606.21970

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Full-duplex spoken dialogue models, such as Moshi, enable natural, low-latency voice conversations. However, they remain limited to the audio modality, lacking the facial expressions that are integral to human communication. We present Moshi-Face, the first full-duplex dialogue model that jointly processes the user's audio and facial input while simultaneously generating speech and facial motion. We first construct a vector-quantized variational autoencoder (VQ-VAE) as a face codec that encodes 3D head meshes extracted from facial videos into compact discrete tokens, referred to as face tokens, and conversely reconstructs 3D meshes from these tokens. We then extend Moshi with a Face Transformer module that generates face tokens non-autoregressively, enabling Moshi-Face to produce synchronized audio and face tokens in real time. Experiments show that Moshi-Face achieves audiovisual alignment at low latency while preserving the dialogue quality of the original audio-only model.

representative citing papers

Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

cs.HC · 2026-06-20 · unverdicted · novelty 6.0

Moshi-Face is the first full-duplex spoken dialogue model that jointly processes audio and facial inputs while generating synchronized speech and facial motion tokens at low latency.

citing papers explorer

Showing 1 of 1 citing paper.

Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems cs.HC · 2026-06-20 · unverdicted · none · ref 2 · internal anchor
Moshi-Face is the first full-duplex spoken dialogue model that jointly processes audio and facial inputs while generating synchronized speech and facial motion tokens at low latency.

Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

fields

years

verdicts

representative citing papers

citing papers explorer