Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production
read the original abstract
Earlier Sign Language Production (SLP) models typically relied on autoregressive decoding, which naturally preserves temporal causality but suffers from error accumulation at inference time. More recent diffusion-based approaches improve generation quality through iterative denoising, yet their sequence-level refinement process introduces substantial latency. To address this trade-off, we propose HybridSign, a hybrid autoregressive-diffusion model for low-latency sign language production that combines causal frame generation with flow-based diffusion refinement. A Multi-Scale Pose Representation module captures fine-grained articulator features, while a Confidence-Aware Causal Attention mechanism leverages joint-level confidence scores to improve robustness under noisy 2D pose observations. Experiments on PHOENIX14T and How2Sign show that HybridSign consistently achieves the best quality--efficiency trade-off among the compared baselines. On the How2Sign test split, it reaches BLEU-1/4 scores of 30.12/6.48 and DTW of 3.89, while reducing time-to-first-frame to 5.90s and increasing throughput to 10.17 FPS under a 60-frame evaluation protocol.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning
A generative framework using geometric diffusion for brain networks and tabular diffusion for other organs integrates ICD-coded SDoH proxies to improve disease reasoning on UK Biobank data.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.