pith. sign in

arxiv: 2606.21234 · v1 · pith:NH3GT6NLnew · submitted 2026-06-19 · 💻 cs.CV

Context-Aware Autoregressive Diffusion for Gloss-Wise Sign Language Production

Pith reviewed 2026-06-26 14:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords sign language productionautoregressive diffusiongloss-wise generationcoarticulation modelingcontext-aware generationmotion continuitydiffusion models for sequences
0
0 comments X

The pith

Gloss-wise autoregressive diffusion produces more accurate and natural sign language sequences by modeling coarticulation through context conditioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current sign language production methods often generate entire sentences at once, which causes temporal drift, motion blur, and loss of control over individual glosses as sentences lengthen. The paper introduces GARD, a gloss-wise autoregressive diffusion approach that generates one gloss at a time while conditioning on both semantic meaning and kinematic motion contexts. This setup allows better modeling of how signs influence each other in sequence. Additional components align transitions between glosses using gradient guidance and refine the overall motion for consistency. If effective, this would yield sign language output that is both more linguistically precise and closer to natural human motion on benchmark datasets.

Core claim

The central claim is that the Context-aware Gloss-wise AutoRegressive Diffusion model (GARD) achieves superior performance over existing SLP methods in linguistic accuracy and motion similarity on the Phoenix-T and CSL-Daily datasets. It does so by modeling coarticulation via conditioning on semantic and kinematic contexts together with Inter-Gloss Transition Guidance and the Global Motion Harmonizer to ensure seamless pose consistency and natural continuity between glosses.

What carries the argument

GARD, the gloss-wise autoregressive diffusion framework that conditions generation on semantic and kinematic contexts to model coarticulation, supported by gradient-based Inter-Gloss Transition Guidance and Global Motion Harmonizer.

If this is right

  • Individual glosses can be controlled more accurately in generated sign language sentences.
  • Longer sign language sequences suffer less from temporal drift and hand motion blur.
  • Both linguistic accuracy and motion similarity metrics improve compared to end-to-end generation methods.
  • Natural continuity between glosses is achieved through boundary alignment and sequence refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This gloss-wise strategy may extend to other autoregressive generation tasks in animation or robotics where unit-level control improves overall coherence.
  • Testing on additional sign language datasets could reveal whether the context conditioning generalizes across different languages and dialects.
  • Combining this diffusion approach with real-time input might enable more responsive sign language translation systems.

Load-bearing premise

That conditioning on semantic and kinematic contexts combined with the Inter-Gloss Transition Guidance and Global Motion Harmonizer produces seamless pose consistency and natural continuity between glosses without introducing new artifacts or reducing linguistic fidelity.

What would settle it

A direct comparison on the Phoenix-T and CSL-Daily datasets where GARD fails to outperform existing methods in both linguistic accuracy and motion similarity metrics, or where generated motions show visible inconsistencies at gloss boundaries.

Figures

Figures reproduced from arXiv: 2606.21234 by Boeun Kim, Changho Kim, Chu Xin, Hyung Jin Chang, JungHoon Sung, Sang-Il Choi, Younggeun Choi.

Figure 1
Figure 1. Figure 1: Overview of sign language production process by GARD. (a) Ground-truth 3D mesh motions for two consecutive glosses. (b) Generation process of the gloss motion “MORGEN”. In the next step, GARD will autoregressively generate “DONN” using semantic and kinematic cues from “MORGEN”. prediction errors, which progressively degrade motion quality toward the end of long sequences [14,24,56]. Furthermore, this appro… view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of GARD. (Left) Forward and reverse diffusion pro￾cesses for the n-th gloss. (Right) During the reverse process, the denoiser ϵθ predicts the noise ϵ at timestep t, conditioned on a set of context vectors. The previous gloss word gn−1 and its corresponding motion m0 n−1 are utilized as semantic and kinematic contexts to condition the generation of the current gloss motion. IG-Guidance ref… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Our GARD’s Gloss Motion Harmonizer. 3.4 Inter-Gloss Transition Guidance Despite conditioning on the previous motion context, the pose gap at the bound￾ary between consecutive generated glosses is not fully eliminated. To address this issue, we introduce IG-Guidance to ensure a natural transition between the last pose of the previous gloss gn−1 (the kinematic hint Khint) and the first pose o… view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparisons with SOTA Method, SOKE [63] and S-MotionGPT [25] 4.1 Comparison with State-of-the-Art Methods Quantitative Results. In [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation results for IG-Guidance and GM-Harmonizer base model successfully generates the basic motion for each gloss, it fails to produce a proper transition between consecutive glosses. In contrast, the model with GM-Harmonizer applied begins the next sign motion from a state (e.g., hand shape and body position) that is noticeably closer to the previous gloss. This confirms that it improves te… view at source ↗
read the original abstract

To generate natural and accurate sentence-level sign language, synthesizing the "gloss", the fundamental semantic unit, is essential. However, most current sign-language production (SLP) methods generate entire sequences at once. While this end-to-end approach is often efficient, it is prone to temporal drift and hand motion blur as sentences get longer, and fails to accurately control individual glosses. In this paper, we propose the Context-aware Gloss-wise AutoRegressive Diffusion model (GARD), a gloss-wise diffusion framework that models coarticulation by conditioning on both semantic (linguistic) and kinematic (motion) contexts. To ensure natural continuity between gloss motions, GARD introduces two additional strategies: i) Inter-Gloss Transition Guidance, which applies gradient-based guidance to kinematically align inter-gloss boundaries and ensure seamless pose consistency. ii) Global Motion Harmonizer, refining the entire gloss motion sequence based on the boundary poses adjusted by Inter-Gloss Transition Guidance. Extensive experiments on Phoenix-T and CSL-Daily datasets demonstrate that GARD achieves superior performance over existing SLP methods in terms of both linguistic accuracy and motion similarity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes the Context-aware Gloss-wise AutoRegressive Diffusion model (GARD) for sign language production (SLP). It generates sentence-level signs gloss by gloss via an autoregressive diffusion process conditioned on both semantic (linguistic) and kinematic (motion) contexts to model coarticulation. Two additional components are introduced: Inter-Gloss Transition Guidance, which uses gradient-based guidance to kinematically align inter-gloss boundaries, and a Global Motion Harmonizer that refines the full sequence based on the adjusted boundary poses. The central claim is that extensive experiments on the Phoenix-T and CSL-Daily datasets show GARD achieves superior performance over existing SLP methods in both linguistic accuracy and motion similarity.

Significance. If the empirical results hold, the work could advance SLP by offering improved control over individual glosses and better handling of coarticulation and temporal drift in longer sequences through a gloss-wise diffusion approach with dual-context conditioning and guidance mechanisms. This addresses known limitations of end-to-end sequence generation methods.

major comments (1)
  1. [Abstract] Abstract: The abstract asserts that GARD achieves superior performance over existing SLP methods on Phoenix-T and CSL-Daily in terms of linguistic accuracy and motion similarity, but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. This prevents verification of whether the data supports the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that GARD achieves superior performance over existing SLP methods on Phoenix-T and CSL-Daily in terms of linguistic accuracy and motion similarity, but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. This prevents verification of whether the data supports the central claim.

    Authors: The referee is correct that the abstract states the performance claim at a high level without embedding specific numbers or comparisons. The manuscript body (Sections 4–5) contains the full quantitative results, baselines, ablations, and error analysis on both datasets. To improve verifiability directly from the abstract, we will revise it to include the key reported metrics (e.g., specific gains in BLEU, DTW, and motion similarity scores) while preserving its concise nature. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model proposal with external evaluation

full rationale

The paper introduces GARD as a novel gloss-wise autoregressive diffusion framework with semantic/kinematic conditioning, Inter-Gloss Transition Guidance, and Global Motion Harmonizer. These are architectural choices and training strategies presented as new contributions, then evaluated for superiority on external benchmark datasets (Phoenix-T, CSL-Daily) using standard linguistic and motion metrics. No derivation reduces a claimed prediction to an input by construction, no parameter is fitted then renamed as a prediction, and no load-bearing premise rests on a self-citation chain. The central claims are empirical performance results rather than tautological or self-referential mathematics, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the model is expected to rest on standard diffusion model assumptions and computer vision training practices whose specific free parameters are not disclosed.

free parameters (1)
  • diffusion and autoregressive training hyperparameters
    Typical in such models but unspecified in the abstract

pith-pipeline@v0.9.1-grok · 5747 in / 1063 out tokens · 28150 ms · 2026-06-26T14:17:28.876074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Arkushin, R.S., Moryossef, A., Fried, O.: Ham2pose: Animating sign language no- tation into pose sequences. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 21046–21056 (2023)

  2. [2]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

    Baltatzis, V., Potamias, R.A., Ververas, E., Sun, G., Deng, J., Zafeiriou, S.: Neu- ral sign actors: A diffusion model for 3d sign language production from text. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 1985–1995 (June 2024)

  3. [3]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., Zhu, J.: All are worth words: A vit backbone for diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22669–22679 (2023)

  4. [4]

    Advances in neural information pro- cessing systems28(2015)

    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information pro- cessing systems28(2015)

  5. [5]

    In: Proceedings of the 3rd international conference on knowledge discovery and data mining

    Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: Proceedings of the 3rd international conference on knowledge discovery and data mining. pp. 359–370 (1994)

  6. [6]

    Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, P., Braffort, A., Caselli, N., Huenerfauth, M., Kacorri, H., Verhoef, T., et al.: Sign language recognition, generation,andtranslation:Aninterdisciplinaryperspective.In:Proceedingsofthe 21st international ACM SIGACCESS conference on computers and accessibility. pp. 16–31 (2019)

  7. [7]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7784–7793 (2018)

  8. [8]

    Channer, C.S.: Coarticulation in american sign language fingerspelling (2012)

  9. [9]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18000–18010 (2023)

  10. [10]

    Chen, Y., Wei, F., Sun, X., Wu, Z., Lin, S.: A simple multi-modality transfer learning baseline for sign language translation (2023),https://arxiv.org/abs/ 2203.04287

  11. [11]

    Advances in Neural Information Processing Systems35, 17043–17056 (2022)

    Chen, Y., Zuo, R., Wei, F., Wu, Y., Liu, S., Mak, B.: Two-stream network for sign language recognition and translation. Advances in Neural Information Processing Systems35, 17043–17056 (2022)

  12. [12]

    In: 2024 IEEE 18th Interna- tional Conference on Automatic Face and Gesture Recognition (FG)

    Dong, L., Chaudhary, L., Xu, F., Wang, X., Lary, M., Nwogu, I.: Signavatar: Sign language 3d motion reconstruction and generation. In: 2024 IEEE 18th Interna- tional Conference on Automatic Face and Gesture Recognition (FG). pp. 1–10. IEEE (2024)

  13. [13]

    In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

    Dong, L., Wang, X., Nwogu, I.: Word-conditioned 3D American Sign Language motion generation. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 9993–9999. Association for Computational Linguistics, Miami, Florida, USA (Nov 2024). https://doi.org/10.18653/v1/2024.findings-emnlp.584

  14. [14]

    arXiv preprint arXiv:2308.16082 (2023)

    Fang, S., Sui, C., Zhou, Y., Zhang, X., Zhong, H., Zhao, M., Tian, Y., Chen, C.: Signdiff: Diffusion models for american sign language production. arXiv preprint arXiv:2308.16082 (2023)

  15. [15]

    arXiv preprint arXiv:2405.15439 (2024) 16 J

    Geng, Z., Han, C., Hayder, Z., Liu, J., Shah, M., Mian, A.: Text-guided 3d human motion generation with keyframe-based parallel skip transformer. arXiv preprint arXiv:2405.15439 (2024) 16 J. Sung et al

  16. [16]

    In: Computer Graphics Forum

    Ghorbani, S., Wloka, C., Etemad, A., Brubaker, M.A., Troje, N.F.: Probabilistic character motion synthesis using a hierarchical deep latent variable model. In: Computer Graphics Forum. vol. 39, pp. 225–239. Wiley Online Library (2020)

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Gong, J., Foo, L.G., He, Y., Rahmani, H., Liu, J.: Llms are good sign language translators. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18362–18372 (2024)

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating di- verse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5152–5161 (June 2022)

  19. [19]

    In: Proceedings of the 28th ACM international conference on multimedia

    Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Conditioned generation of 3d human motions. In: Proceedings of the 28th ACM international conference on multimedia. pp. 2021–2029 (2020)

  20. [20]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  21. [21]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  22. [22]

    In: Proceedings of the 29th ACM International Conference on Multimedia

    Huang, W., Pan, W., Zhao, Z., Tian, Q.: Towards fast and high-quality sign lan- guage production. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 3172–3181 (2021)

  23. [23]

    In: BMVC

    Hwang, E.J., Kim, J.H., Park, J.C.: Non-autoregressive sign language production with gaussian space. In: BMVC. vol. 1, p. 3 (2021)

  24. [24]

    In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)

    Hwang, E.J., Lee, H., Park, J.C.: A gloss-free sign language production with dis- crete representation. In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). pp. 1–6. IEEE (2024)

  25. [25]

    Advances in Neural Information Processing Systems36, 20067–20079 (2023)

    Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems36, 20067–20079 (2023)

  26. [26]

    In: Pro- ceedings of the IEEE/CVF international conference on computer vision

    Jiao, P., Min, Y., Li, Y., Wang, X., Lei, L., Chen, X.: Cosign: Exploring co- occurrence signals in skeleton-based continuous sign language recognition. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 20676– 20686 (2023)

  27. [27]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided mo- tion diffusion for controllable human motion synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2151–2162 (2023)

  28. [28]

    In: Text sum- marization branches out

    Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

  29. [29]

    Transactions of the Association for Computational Linguistics8, 726–742 (2020)

    Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L.: Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics8, 726–742 (2020)

  30. [30]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  31. [31]

    The handbook of linguistic human rights pp

    Manning, V., Murray, J.J., Bloxs, A.: Linguistic human rights in the work of the world federation of the deaf. The handbook of linguistic human rights pp. 267–280 (2022)

  32. [32]

    In: ISSP 2024-13th International Seminar on Speech Production

    Mertz, J., Pagel, L., Perniss, P., Turco, G., Mücke, D.: Coarticulation in sign lan- guage: A kinematic study on french sign language (lsf) using electromagnetic artic- ulography (ema). In: ISSP 2024-13th International Seminar on Speech Production. pp. 51–54. ISCA (2024)

  33. [33]

    Murray, J.: World federation of the deaf (2020) Conter paper title 17

  34. [34]

    In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single im- age. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019)

  36. [36]

    In: European conference on computer vision

    Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: European conference on computer vision. pp. 480–

  37. [37]

    In: International Conference on Medical image computing and computer-assisted intervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

  38. [38]

    In: European Conference on Computer Vision

    Saunders, B., Camgoz, N.C., Bowden, R.: Progressive transformers for end-to-end sign language production. In: European Conference on Computer Vision. pp. 687–

  39. [39]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Saunders, B., Camgoz, N.C., Bowden, R.: Mixed signals: Sign language production via a mixture of motion primitives. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1919–1929 (2021)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Saunders,B.,Camgoz,N.C.,Bowden,R.:Signingatscale:Learningtoco-articulate signs for large-scale photo-realistic sign language production. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5141– 5151 (2022)

  41. [41]

    ACM Sigaccess Accessibility and Computing (93), 31–38 (2009)

    Segouat, J.: A study of sign language coarticulation. ACM Sigaccess Accessibility and Computing (93), 31–38 (2009)

  42. [42]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Shan, W., Liu, Z., Zhang, X., Wang, Z., Han, K., Wang, S., Ma, S., Gao, W.: Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14761–14771 (2023)

  43. [43]

    Journal of deaf studies and deaf education10(1), 3–37 (2005)

    Stokoe Jr, W.C.: Sign language structure: An outline of the visual communication systems of the american deaf. Journal of deaf studies and deaf education10(1), 3–37 (2005)

  44. [45]

    In: 2022 International Conference on 3D Vision (3DV)

    Stoll, S., Mustafa, A., Guillemaut, J.Y.: There and back again: 3d sign language generation from text using back-translation. In: 2022 International Conference on 3D Vision (3DV). pp. 187–196. IEEE (2022)

  45. [46]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

    Tang, S., He, J., Cheng, L., Wu, J., Guo, D., Hong, R.: Discrete to continuous: Generating smooth transition poses from sign language observations. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 3481–3491 (2025)

  46. [47]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Tang, S., He, J., Guo, D., Wei, Y., Li, F., Hong, R.: Sign-idd: Iconicity disentangled diffusion for sign language production. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 7266–7274 (2025)

  47. [48]

    In: ACM International Con- ference on Multimedia (ACM MM) (Oct 2022)

    Tang, S., Hong, R., Guo, D., Wang, M.: Gloss semantic-enhanced network with online back-translation for sign language production. In: ACM International Con- ference on Multimedia (ACM MM) (Oct 2022)

  48. [49]

    ACM Transactions on Multimedia Comput- ing, Communications, and Applications (2024) 18 J

    Tang, S., Xue, F., Wu, J., Wang, S., Hong, R.: Gloss-driven conditional diffusion models for sign language production. ACM Transactions on Multimedia Comput- ing, Communications, and Applications (2024) 18 J. Sung et al

  49. [50]

    Human Motion Diffusion Model

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)

  50. [51]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  51. [52]

    arXiv preprint arXiv:2202.05383 (2022)

    Viegas, C., Inan, M., Quandt, L., Alikhani, M.: Including facial expressions in con- textual embeddings for sign language generation. arXiv preprint arXiv:2202.05383 (2022)

  52. [53]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Xie, P., Peng, T., Du, Y., Zhang, Q.: Sign language production with latent motion transformer. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 3024–3034 (2024)

  53. [54]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Xie, P., Zhang, Q., Taiying, P., Tang, H., Du, Y., Li, Z.: G2p-ddm: Generating sign pose sequence from gloss sequence with discrete diffusion model. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 6234–6242 (2024)

  54. [55]

    In: The Twelfth International Conferenceon LearningRepresentations(2024),https://openreview.net/forum? id=gd0lAEtWso

    Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint at any time for human motion generation. In: The Twelfth International Conferenceon LearningRepresentations(2024),https://openreview.net/forum? id=gd0lAEtWso

  55. [56]

    arXiv preprint arXiv:2406.07119 (2024)

    Yin, A., Li, H., Shen, K., Tang, S., Zhuang, Y.: T2s-gpt: Dynamic vector quan- tization for autoregressive sign language production from text. arXiv preprint arXiv:2406.07119 (2024)

  56. [57]

    Yin, K., Moryossef, A., Hochgesang, J., Goldberg, Y., Alikhani, M.: Including signed languages in natural language processing. In: Proceedings of the 59th An- nual Meeting of the Association for Computational Linguistics and the 11th In- ternational Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 7347–7360 (2021)

  57. [58]

    In: European Conference on Computer Vision

    Yu,Z.,Huang,S.,Cheng,Y.,Birdal,T.:Signavatars:Alarge-scale3dsignlanguage holistic motion dataset and benchmark. In: European Conference on Computer Vision. pp. 1–19. Springer (2024)

  58. [59]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

  59. [60]

    IEEE transactions on pattern analysis and machine intelligence46(6), 4115–4128 (2024)

    Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE transactions on pattern analysis and machine intelligence46(6), 4115–4128 (2024)

  60. [61]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhou, H., Zhou, W., Qi, W., Pu, J., Li, H.: Improving sign language translation with monolingual data by sign back-translation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1316–1325 (2021)

  61. [62]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation rep- resentations in neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5745–5753 (2019)

  62. [63]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zuo, R., Potamias, R.A., Ververas, E., Deng, J., Zafeiriou, S.: Signs as tokens: A retrieval-enhanced multilingual sign language generator. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23806–23816 (2025)

  63. [64]

    In: European Conference on Computer Vision

    Zuo,R.,Wei,F.,Chen,Z.,Mak,B.,Yang,J.,Tong,X.:Asimplebaselineforspoken language to sign language translation with 3d avatars. In: European Conference on Computer Vision. pp. 36–54. Springer (2024)

  64. [65]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Zuo, R., Wei, F., Mak, B.: Towards online continuous sign language recognition and translation. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 11050–11067 (2024)