SemConFlow: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

Asl{\i} \"Ozy\"urek; Esam Ghaleb; Lanmiao Liu; Zerrin Yumak

arxiv: 2603.26553 · v2 · pith:FE74MB2Vnew · submitted 2026-03-27 · 💻 cs.CV

SemConFlow: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

Lanmiao Liu , Esam Ghaleb , Asl{\i} \"Ozy\"urek , Zerrin Yumak This is my paper

classification 💻 cs.CV

keywords methodsco-speechcontrastivegenerationgesturegesturesholisticmotion

0 comments

read the original abstract

While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
cs.GR 2026-05 unverdicted novelty 7.0

PersonaGest uses a semantic-guided RVQ-VAE with a Semantic-Aware Motion Codebook and contrastive learning in stage one, followed by a Masked Generative Transformer and Style Residual Transformers in stage two, to achi...
DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation
cs.CV 2026-05 unverdicted novelty 5.0

DuoGesture introduces a dual-stream architecture for co-speech gesture generation that decouples semantic and beat streams via a stochastic gate and biomechanical regularization, claiming better performance than holis...