PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

· 2026 · cs.CV · arXiv 2605.06064

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We propose PersonaGesture, a diffusion-based pipeline for single-reference co-speech gesture personalization of unseen speakers. Given target speech and one motion clip from a new speaker, the model must synthesize gestures that follow the new utterance while retaining speaker-specific pose choices, without per-speaker optimization. This setting is useful for avatars and virtual agents, but it is hard because the reference mixes stable speaker habits with utterance-specific trajectories. PersonaGesture consists of two key components, Adaptive Style Infusion (ASI) and Implicit Distribution Rectification (IDR), to separate temporal identity evidence from residual statistic correction. A Style Perceiver first encodes the variable-length reference into compact speaker-memory tokens. ASI injects these tokens into denoising through zero-initialized residual cross-attention, enabling style evidence to affect motion formation without replacing the pretrained speech-to-motion prior. Building on this, IDR applies a length-aware diagonal affine map in latent space to correct residual channel-wise moments estimated from the same reference. Across BEAT2 and ZeroEGGS, we evaluate quantitative metrics, reference-identity controls, same-audio diagnostics, qualitative comparisons, and human preference. Experiments show that separating denoising-time speaker memory from conservative post-generation moment correction improves unseen-speaker personalization over collapsed style codes, full-reference attention, and one-clip finetuning. Project: https://xiangyue-zhang.github.io/PersonaGesture.

representative citing papers

OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Introduces CIPE-Dance as the largest dance video dataset and OmniDance framework for unified text-music multimodal dance video generation achieving SOTA on TI2V, MI2V, and MTI2V tasks.

Semantic-Aware Motion Encoding for Topology-Agnostic Character Animation

cs.GR · 2026-05-26 · unverdicted · novelty 5.0

A semantic modulation mechanism decouples motion from topology to create a continuous generative motion space from unaligned BVH data, supporting zero-shot cross-species retargeting.

citing papers explorer

Showing 2 of 2 citing papers.

OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data cs.CV · 2026-06-29 · unverdicted · none · ref 49 · internal anchor
Introduces CIPE-Dance as the largest dance video dataset and OmniDance framework for unified text-music multimodal dance video generation achieving SOTA on TI2V, MI2V, and MTI2V tasks.
Semantic-Aware Motion Encoding for Topology-Agnostic Character Animation cs.GR · 2026-05-26 · unverdicted · none · ref 37 · internal anchor
A semantic modulation mechanism decouples motion from topology to create a continuous generative motion space from unaligned BVH data, supporting zero-shot cross-species retargeting.

PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

fields

years

verdicts

representative citing papers

citing papers explorer