PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

Haiyang Liu; Jiaxu Zhang; Kaixing Yang; Kunhang Li; Xiangyue Zhang; Xuangeng Chu; Yiyi Cai; You Zhou; Zhengqing Li

arxiv: 2605.06064 · v1 · submitted 2026-05-07 · 💻 cs.CV

PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

Xiangyue Zhang , Yiyi Cai , Kunhang Li , Kaixing Yang , You Zhou , Zhengqing Li , Xuangeng Chu , Jiaxu Zhang

show 1 more author

Haiyang Liu

This is my paper

Pith reviewed 2026-05-08 14:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords personagesturestylepersonalizationreferenceresidualspeakerco-speechcorrection

0 comments

The pith

A diffusion model with adaptive style infusion via zero-init cross-attention and implicit distribution rectification via length-aware affine correction personalizes gestures for new speakers from a single reference clip.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work targets the problem of making gestures that match spoken words while keeping the unique movement style of a person who has never been seen before. Only one short video clip of that person is given as reference. A pretrained diffusion model that turns speech into motion is kept mostly unchanged. Style information from the reference is added during the noise-removal steps using a special attention mechanism that starts at zero so it does not overwrite the original model. After generation, a simple correction adjusts the average motion statistics to better match the reference clip. Tests on two public gesture datasets compare this approach against simpler ways of copying style or fine-tuning the whole model.

Core claim

Experiments show that separating denoising-time speaker memory from conservative post-generation moment correction improves unseen-speaker personalization over collapsed style codes, full-reference attention, and one-clip finetuning.

Load-bearing premise

That a single variable-length reference motion clip supplies stable speaker-specific pose choices that can be cleanly separated from utterance-specific trajectories and that the pretrained speech-to-motion prior remains useful after style injection.

read the original abstract

We propose PersonaGesture, a diffusion-based pipeline for single-reference co-speech gesture personalization of unseen speakers. Given target speech and one motion clip from a new speaker, the model must synthesize gestures that follow the new utterance while retaining speaker-specific pose choices, without per-speaker optimization. This setting is useful for avatars and virtual agents, but it is hard because the reference mixes stable speaker habits with utterance-specific trajectories. PersonaGesture consists of two key components, Adaptive Style Infusion (ASI) and Implicit Distribution Rectification (IDR), to separate temporal identity evidence from residual statistic correction. A Style Perceiver first encodes the variable-length reference into compact speaker-memory tokens. ASI injects these tokens into denoising through zero-initialized residual cross-attention, enabling style evidence to affect motion formation without replacing the pretrained speech-to-motion prior. Building on this, IDR applies a length-aware diagonal affine map in latent space to correct residual channel-wise moments estimated from the same reference. Across BEAT2 and ZeroEGGS, we evaluate quantitative metrics, reference-identity controls, same-audio diagnostics, qualitative comparisons, and human preference. Experiments show that separating denoising-time speaker memory from conservative post-generation moment correction improves unseen-speaker personalization over collapsed style codes, full-reference attention, and one-clip finetuning. Project: https://xiangyue-zhang.github.io/PersonaGesture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete single-reference method for unseen-speaker gesture personalization using zero-init style injection plus post-hoc moment correction, but the claimed clean separation from reference trajectories rests on unshown ablations.

read the letter

The core contribution is the pairing of Adaptive Style Infusion (zero-initialized residual cross-attention on compact tokens from a Style Perceiver) with Implicit Distribution Rectification (length-aware diagonal affine correction on channel moments). This lets the model keep a pretrained speech-to-motion diffusion prior while adapting to one variable-length reference clip from an unseen speaker. The approach avoids per-speaker fine-tuning or full attention over the reference, which is a practical step for avatar pipelines. Experiments on BEAT2 and ZeroEGGS report gains on quantitative metrics, identity controls, and human preference over collapsed style codes, full-reference attention, and one-clip tuning. That is the actual new piece and where the work earns credit: it ships two specific, lightweight modules and shows they move the numbers in the right direction for this narrow setting. The soft spot is the entanglement risk. The reference motion itself comes from some utterance, so its pose sequence mixes speaker habits with content-specific timing. ASI has no contrastive term, content mask, or alignment step to isolate the stable part, and IDR only adjusts post-generation statistics in latent space. Without ablations that measure how much of the output is direct copying versus true style transfer, the reported improvements could partly reflect leakage rather than the intended separation. The abstract and stress-test note both flag this, and the paper would need to address it with clearer diagnostics or controls to make the central claim solid. This is for researchers already working on co-speech gesture or avatar animation who need a lightweight personalization route. It is not a broad advance in diffusion or generative modeling. The architecture is clear enough and the results are concrete enough that a serious editor should send it to review rather than desk-reject; the reviewers can press on the disentanglement evidence and ask for the missing equations and ablations. I would bring it to a reading group as a practical case study but would not cite it in my own work unless the separation holds up under closer inspection.

Referee Report

3 major / 3 minor

Summary. The paper introduces PersonaGesture, a diffusion-based framework for single-reference co-speech gesture personalization to unseen speakers. Given target speech and one variable-length motion clip from a new speaker, it synthesizes gestures that follow the utterance while preserving speaker-specific pose habits. The method uses a Style Perceiver to encode the reference into compact tokens, Adaptive Style Infusion (ASI) via zero-initialized residual cross-attention during denoising to inject style without overwriting the pretrained speech-to-motion prior, and Implicit Distribution Rectification (IDR) as a post-generation length-aware diagonal affine correction on latent channel moments. Experiments on BEAT2 and ZeroEGGS report gains in quantitative metrics, reference-identity controls, same-audio diagnostics, and human preferences over collapsed style codes, full-reference attention, and one-clip finetuning.

Significance. If the ASI+IDR separation of identity evidence from residual statistics holds, the approach offers a practical advance for avatar and virtual-agent animation by enabling efficient personalization without per-speaker optimization or full fine-tuning. The design of injecting style tokens only through zero-init attention while keeping the pretrained prior dominant, combined with conservative post-hoc moment correction, is a clear strength that could generalize to other diffusion-based motion tasks. The reproducible evaluation protocol across two datasets with multiple controls strengthens the contribution.

major comments (3)

[§3.2] §3.2 (ASI): The zero-initialized residual cross-attention is intended to let style tokens affect motion formation without replacing the prior, but the Style Perceiver lacks any explicit disentanglement mechanism (contrastive loss, content masking, or temporal alignment) to isolate stable speaker pose habits from the reference clip's own utterance-specific trajectories. This risks the injected tokens carrying content-conditioned information, undermining the central claim that ASI cleanly separates identity evidence.
[§3.3] §3.3 (IDR): The length-aware diagonal affine map corrects channel-wise moments estimated from the reference, but because it operates only post-generation on latent statistics, it leaves temporal trajectory contamination from the reference unaddressed during denoising. The paper should provide the explicit equations for the affine parameters and an ablation showing that IDR does not simply copy reference statistics.
[§4] §4 (Experiments): The reported gains over one-clip finetuning and full-reference attention rest on the assumption that the single reference supplies stable speaker-specific choices separable from content; however, no ablation varies the reference content while holding speaker fixed, nor are error bars or statistical significance provided for the quantitative metrics. This weakens the evidence that improvements stem from true personalization rather than partial reference copying.

minor comments (3)

[Abstract, §1] The abstract and §1 claim 'parameter-free' separation, but the Style Perceiver output tokens and length-aware affine parameters are learned; clarify the exact sense in which the method is parameter-free after pretraining.
[Figure 2] Figure 2 (pipeline diagram) would benefit from explicit annotation of the zero-init cross-attention layers and the IDR affine application point to improve readability.
[§2] Missing reference to prior work on style disentanglement in motion diffusion models (e.g., recent contrastive or masked approaches) in the related-work section.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (ASI): The zero-initialized residual cross-attention is intended to let style tokens affect motion formation without replacing the prior, but the Style Perceiver lacks any explicit disentanglement mechanism (contrastive loss, content masking, or temporal alignment) to isolate stable speaker pose habits from the reference clip's own utterance-specific trajectories. This risks the injected tokens carrying content-conditioned information, undermining the central claim that ASI cleanly separates identity evidence.

Authors: The Style Perceiver is designed to extract compact tokens from the reference clip, and the zero-initialized residual cross-attention ensures that style infusion occurs gradually without disrupting the pretrained prior. While no explicit disentanglement loss is used, the training objective and the separation of ASI from IDR encourage the model to focus on stable pose habits. Our same-audio diagnostics show that gestures match the reference style without replicating the reference's specific trajectories. We will revise §3.2 to include a more detailed explanation of this implicit mechanism and discuss potential limitations. revision: partial
Referee: [§3.3] §3.3 (IDR): The length-aware diagonal affine map corrects channel-wise moments estimated from the reference, but because it operates only post-generation on latent statistics, it leaves temporal trajectory contamination from the reference unaddressed during denoising. The paper should provide the explicit equations for the affine parameters and an ablation showing that IDR does not simply copy reference statistics.

Authors: We agree that the explicit equations for the affine parameters should be included for reproducibility. The length-aware diagonal affine correction adjusts the mean and variance of each latent channel based on the reference's statistics, with scaling by the length ratio between reference and target. We will add these equations to §3.3. Additionally, we will include an ablation study demonstrating that IDR provides corrections beyond simple copying by comparing variants with and without IDR on metrics and qualitative results. revision: yes
Referee: [§4] §4 (Experiments): The reported gains over one-clip finetuning and full-reference attention rest on the assumption that the single reference supplies stable speaker-specific choices separable from content; however, no ablation varies the reference content while holding speaker fixed, nor are error bars or statistical significance provided for the quantitative metrics. This weakens the evidence that improvements stem from true personalization rather than partial reference copying.

Authors: Our evaluation protocol includes reference-identity controls and same-audio diagnostics to verify that the personalization preserves speaker-specific pose habits independent of the reference's content. We agree that error bars and statistical significance would strengthen the quantitative results. We will add error bars and report p-values from statistical tests in the revised §4. However, performing a new ablation that varies reference content while holding the speaker fixed is not feasible with the current datasets without additional data collection. revision: partial

standing simulated objections not resolved

Ablation varying the reference content while holding speaker fixed, due to dataset limitations preventing such controlled variations without new experiments.

Circularity Check

0 steps flagged

No circularity: empirical pipeline on external pretrained prior

full rationale

The paper describes a diffusion pipeline that encodes a single reference motion clip into style tokens via a Style Perceiver, injects them via zero-init cross-attention (ASI) during denoising, and applies a post-hoc length-aware affine correction (IDR) on channel moments. All claims of improvement are supported by direct experimental comparisons on BEAT2 and ZeroEGGS against explicit baselines (collapsed codes, full-reference attention, one-clip finetuning). No equation, loss term, or uniqueness argument reduces the synthesized output to a quantity defined by the same reference by construction; the pretrained speech-to-motion prior is used as an independent starting point whose parameters are not refit inside the paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on a pretrained diffusion prior plus two new modules whose parameters are learned from data; the key domain assumption is that reference clips contain separable stable identity evidence.

free parameters (2)

Style Perceiver output tokens
Compact speaker-memory tokens extracted from the variable-length reference clip; their dimensionality and encoding are learned.
Length-aware diagonal affine parameters
Scale and shift values for the post-generation moment correction estimated from the same reference.

axioms (1)

domain assumption The reference motion clip mixes stable speaker habits with utterance-specific trajectories that can be separated by the proposed modules.
Explicitly stated in the abstract as the core difficulty the method addresses.

pith-pipeline@v0.9.0 · 5569 in / 1314 out tokens · 64278 ms · 2026-05-08T14:18:37.624945+00:00 · methodology

PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

Core claim

Load-bearing premise

discussion (0)