Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

· 2025 · cs.SD · arXiv 2510.12834

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.

representative citing papers

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

cs.SD · 2025-10-13 · unverdicted · novelty 6.0

A unified discrete autoregressive model for joint text-to-speech and co-speech gesture synthesis via interleaved token sequences and modality-specific decoders.

citing papers explorer

Showing 1 of 1 citing paper.

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction cs.SD · 2025-10-13 · unverdicted · none · ref 2 · internal anchor
A unified discrete autoregressive model for joint text-to-speech and co-speech gesture synthesis via interleaved token sequences and modality-specific decoders.

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

fields

years

verdicts

representative citing papers

citing papers explorer