The tokenizers independently convert continuous speech and gestures to discrete indices corresponding to latent codes in a vocabulary

GELINA ARCHITECTURE Gelina is a bimodal generative model that has three core components, which are depicted for gestures in Figure 1

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

cs.SD · 2025-10-13 · unverdicted · novelty 6.0

A unified discrete autoregressive model for joint text-to-speech and co-speech gesture synthesis via interleaved token sequences and modality-specific decoders.

citing papers explorer

Showing 1 of 1 citing paper.

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction cs.SD · 2025-10-13 · unverdicted · none · ref 3
A unified discrete autoregressive model for joint text-to-speech and co-speech gesture synthesis via interleaved token sequences and modality-specific decoders.

The tokenizers independently convert continuous speech and gestures to discrete indices corresponding to latent codes in a vocabulary

fields

years

verdicts

representative citing papers

citing papers explorer