Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
Seam- less interaction: Dyadic audiovisual motion modeling and large-scale dataset
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5representative citing papers
SentiAvatar generates expressive interactive 3D avatars in real time by combining a 37-hour mocap dialogue dataset with a pre-trained motion foundation model and an audio-aware plan-then-infill architecture that separates semantic planning from prosody-driven frame interpolation.
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Avatar appearance and facial presentation systematically bias perceptual judgments of synthesized co-speech gestures.
citing papers explorer
-
Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels
Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
-
SentiAvatar: Towards Expressive and Interactive Digital Humans
SentiAvatar generates expressive interactive 3D avatars in real time by combining a 37-hour mocap dialogue dataset with a pre-trained motion foundation model and an audio-aware plan-then-infill architecture that separates semantic planning from prosody-driven frame interpolation.
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
LPM 1.0: Video-based Character Performance Model
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
-
Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures
Avatar appearance and facial presentation systematically bias perceptual judgments of synthesized co-speech gestures.