Sequence-to-Sequence Natural Language to Humanoid Robot Sign Language

Bartek {\L}ukawski; Carlos Balaguer; Jennifer J. Gago; Juan G. Victores; Ugo Pattacini; Vadim Tikhanoff; Valentina Vasco

arxiv: 1907.04198 · v1 · pith:HX2MBHFRnew · submitted 2019-07-09 · 💻 cs.RO · cs.CL· cs.HC· cs.LG

Sequence-to-Sequence Natural Language to Humanoid Robot Sign Language

Jennifer J. Gago , Valentina Vasco , Bartek {\L}ukawski , Ugo Pattacini , Vadim Tikhanoff , Juan G. Victores , Carlos Balaguer This is my paper

Pith reviewed 2026-05-25 00:26 UTC · model grok-4.3

classification 💻 cs.RO cs.CLcs.HCcs.LG

keywords sign language translationsequence-to-sequence modelshumanoid robotneural networksOpenPoseSpanish sign languagehuman-robot interactionskeleton data acquisition

0 comments

The pith

Sequence-to-sequence neural networks translate natural language text into sign language movements for the humanoid robot TEO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a data-driven method to convert Spanish text into sign language gestures executed by the TEO robot. It selects sequence-to-sequence models to manage the mismatch in input and output sequence lengths and to capture non-manual markers without relying on hand-crafted rules. The approach requires collecting training data through OpenPose and skeletonRetriever paired with a 3D sensor, then training the networks so the robot produces the corresponding movements automatically.

Core claim

By training sequence-to-sequence models on skeleton data acquired from human signers, the TEO humanoid robot can convert natural language input into the corresponding Spanish sign language output movements, addressing length discordance and non-manual markers through a data-driven process rather than expert systems.

What carries the argument

Sequence-to-sequence (seq2seq) neural network models that map variable-length text sequences to variable-length movement sequences while incorporating non-manual markers.

If this is right

The TEO robot can produce Spanish sign language output from text input without manual programming of each sign.
Sequence-to-sequence models bypass the complexity limits of traditional rule-based translation systems for sign language.
Human skeleton acquisition via OpenPose supplies the movement data needed to train the translation models.
A 3D sensor study identifies hardware capable of supporting the required data collection for training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-collection and training pipeline could be applied to other sign languages if equivalent skeleton recordings are obtained.
Combining the text-to-movement model with speech-to-text would allow spoken language to be rendered as robot sign language in real time.
Successful deployment would enable direct sign-language interaction between the robot and deaf users without an interpreter.

Load-bearing premise

OpenPose and skeletonRetriever together with a suitable 3D sensor will produce training data of sufficient quality and quantity to train seq2seq models that correctly handle non-manual markers and length mismatches in sign language.

What would settle it

A trained model that generates movement sequences judged incorrect by fluent Spanish sign language users on test sentences containing non-manual markers.

Figures

Figures reproduced from arXiv: 1907.04198 by Bartek {\L}ukawski, Carlos Balaguer, Jennifer J. Gago, Juan G. Victores, Ugo Pattacini, Vadim Tikhanoff, Valentina Vasco.

**Figure 1.** Figure 1: Sequence-to-sequence model: several possible layouts. The main unit of a conventional neural network is the perceptron, of input x and output y = g [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: The LSTM cell internals. These networks are typically trained in a supervised fashion, providing paired input (x) and output (y) examples. The output predictions (y˜) are computed via a Forward Propagation process, whereas the weights and bias updates are computed during a Back Propagation Through Time process. Optimization algorithms that are used for conventional neural networks, such as Stochastic Gradi… view at source ↗

**Figure 3.** Figure 3: OpenPose 2D skeleton pose estimation. This library achieves high accuracy and performance regardless of the number of people in the image by using a non-parametric representation of 2D vector fields that encode the position and orientation of body parts over the image domain and their degree of association, referred to as Part Affinity Fields (PAFs), in order to learn to relate them to each individual. Ope… view at source ↗

**Figure 4.** Figure 4: and consists of a set of modules interconnected together on a YARP network [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Depth filtering. Thereby, the estimate of the 3D keypoint pi is readily determined by resorting to the projection of the i-th 2D keypoint coordinates (u, v)i identified by the detector and the relative filtered depth di , using the classical pinhole camera model characterized by the focal f and the image width and height (w, h): pi = di ∗ © « ui−w/2 f vi−h/2 f 1 ª ® ® ¬ (7) The resulting 3D estimates … view at source ↗

**Figure 6.** Figure 6: Optimization applied to skeleton limbs. 4 Experiments: Sequence-to-Sequence Model The sequence-to-sequence model accepts natural language text sequences as input, and outputs translated LSE token sequences. To achieve this task, both the input and output sequences are tokenized at word-level, reserving additional tokens for exclamation and question marks, and each token is encoded via a one-hot vector. The… view at source ↗

**Figure 7.** Figure 7: Training loss and cross-validation loss during training [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: TEO robot signing via LUT from LSE tokens to execution. The use of the presented sequence-to-sequence model is justified by the proofof-concept point of view. Future prospects may focus on attention mechanisms and augmented RNNs, that will allow neural networks to work with larger sequences of data including text and video, which is especially useful for this topic, since sign language is basically a visu… view at source ↗

read the original abstract

This paper presents a study on natural language to sign language translation with human-robot interaction application purposes. By means of the presented methodology, the humanoid robot TEO is expected to represent Spanish sign language automatically by converting text into movements, thanks to the performance of neural networks. Natural language to sign language translation presents several challenges to developers, such as the discordance between the length of input and output data and the use of non-manual markers. Therefore, neural networks and, consequently, sequence-to-sequence models, are selected as a data-driven system to avoid traditional expert system approaches or temporal dependencies limitations that lead to limited or too complex translation systems. To achieve these objectives, it is necessary to find a way to perform human skeleton acquisition in order to collect the signing input data. OpenPose and skeletonRetriever are proposed for this purpose and a 3D sensor specification study is developed to select the best acquisition hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a methodology proposal for text-to-sign-language on a robot using standard seq2seq, with no implementation, data, or results.

read the letter

The main thing to know is that this paper lays out a plan to translate Spanish text into movements for the TEO humanoid robot using sequence-to-sequence neural networks, but it contains no code, dataset, trained model, or test results of any kind. Everything is described as future work or an expectation. They correctly flag the real difficulties in sign language, such as handling different sequence lengths between text and signs and capturing non-manual markers like facial expressions. Choosing a data-driven neural approach over hand-crafted rules is a reasonable response to those issues. The suggestion to use OpenPose plus a 3D sensor for skeleton data collection is also a practical starting point that draws on existing tools. Nothing in the method is new. Seq2seq models come straight from machine translation literature, and the paper applies them to this domain without adding algorithms or derivations. The central weakness is the complete absence of evidence. The plan depends on collecting training data that is detailed enough for non-manual elements and variable lengths, yet the paper shows no pilot capture, no quality checks, and no indication that OpenPose will handle the nuances reliably. This leaves the feasibility untested. The work would mainly interest robotics researchers brainstorming accessibility applications who want a high-level roadmap of the problem. It does not supply methods or findings that others could use or verify. I would not bring it to a reading group or cite it. It does not look ready for peer review because the contribution is an unexecuted outline rather than demonstrated progress.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a methodology for natural language to Spanish Sign Language translation aimed at human-robot interaction with the humanoid robot TEO. It identifies challenges of input-output length mismatch and non-manual markers, selects sequence-to-sequence neural networks to address them, and outlines data acquisition via OpenPose and skeletonRetriever paired with a 3D sensor after a hardware specification study. No implemented models, datasets, or performance results are presented; the work is framed as a planned pipeline whose success is expected due to neural network capabilities.

Significance. If the outlined pipeline can be realized and validated, the work would contribute to accessible robotics by automating sign language generation, addressing a practical gap in HRI for deaf users. The manuscript receives credit for explicitly framing the problem, selecting a data-driven seq2seq approach over expert systems, and identifying the data-capture step as prerequisite; however, the absence of any concrete implementation or preliminary evidence keeps the significance prospective rather than demonstrated.

major comments (2)

[Abstract] Abstract: the central claim that 'the humanoid robot TEO is expected to represent Spanish sign language automatically by converting text into movements, thanks to the performance of neural networks' is presented without any model architecture details, training procedure, or preliminary results, leaving the handling of length discordance and non-manual markers as an untested expectation rather than a substantiated plan.
[Abstract] Abstract: the weakest assumption—that OpenPose and skeletonRetriever together with a suitable 3D sensor will yield training data of sufficient quality and quantity—is stated without analysis of known limitations of these tools (e.g., reduced accuracy on facial expressions or fine hand articulations required for non-manual markers), which directly affects the feasibility of the seq2seq training objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our methodology proposal for natural language to Spanish Sign Language translation. The manuscript outlines a planned pipeline using seq2seq models and identifies data acquisition needs; we address the abstract concerns below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'the humanoid robot TEO is expected to represent Spanish sign language automatically by converting text into movements, thanks to the performance of neural networks' is presented without any model architecture details, training procedure, or preliminary results, leaving the handling of length discordance and non-manual markers as an untested expectation rather than a substantiated plan.

Authors: We agree the abstract phrasing implies more than the work delivers. This manuscript presents a proposed methodology and rationale for selecting seq2seq models to handle the identified challenges, without implementation or results. We will revise the abstract to explicitly frame the work as an outline of the planned approach rather than a validated system. revision: yes
Referee: [Abstract] Abstract: the weakest assumption—that OpenPose and skeletonRetriever together with a suitable 3D sensor will yield training data of sufficient quality and quantity—is stated without analysis of known limitations of these tools (e.g., reduced accuracy on facial expressions or fine hand articulations required for non-manual markers), which directly affects the feasibility of the seq2seq training objective.

Authors: The observation is correct; the manuscript does not analyze these tool limitations. We will add discussion of OpenPose's known constraints on facial expressions and fine hand movements, their relevance to non-manual markers, and how the hardware study informs data collection feasibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a methodology proposal that describes a planned text-to-sign-language pipeline using standard seq2seq neural networks, OpenPose for skeleton data, and a 3D sensor study. No equations, fitted parameters, derived predictions, or self-citations appear in the provided text. The central claim is an untested expectation rather than a derivation that reduces to its own inputs by construction; the approach rests on conventional neural-network assumptions and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that data-driven seq2seq models will overcome limitations of expert systems for sign language without any new parameters or entities introduced.

axioms (1)

domain assumption Sequence-to-sequence neural networks can effectively manage discordance in input/output lengths and non-manual markers when trained on skeleton data from sign language.
Explicitly stated in the abstract as the rationale for selecting neural networks over traditional approaches.

pith-pipeline@v0.9.0 · 5719 in / 1235 out tokens · 24294 ms · 2026-05-25T00:26:44.525126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 3 internal anchors

[1]

Herrero Blanco, Á., Abellán, A., & José, J. (1999). Fonología y escritura de la lengua de signos española. ELUA. Estudios de Lingüística, N. 13 (1999); pp. 89–116

work page 1999
[2]

Gago, J., Victores, J., & Balaguer, C. (2019). Sign Language Representation by TEO HumanoidRobot:End-UserInterest,ComprehensionandSatisfaction.Electronics,8(1), 57

work page 2019
[3]

Jaﬀe, D. L. (1994). Evolution of mechanical ﬁngerspelling hands for people who are deaf-blind. Journal of rehabilitation research and development, 31(3), 236–244

work page 1994
[4]

Parton, B. S. (2005). Sign language recognition and translation: A multidisciplined ap- proachfromtheﬁeldofartiﬁcialintelligence.Journalofdeafstudiesanddeafeducation, 11(1), 94–101

work page 2005
[5]

Starner, T., & Pentland, A. (1997). Real-time american sign language recognition from video using hidden markov models. In Motion-Based Recognition (pp. 227–243). Springer, Dordrecht

work page 1997
[6]

Vogler, C., & Metaxas, D. (1999). Parallel hidden markov models for american sign language recognition. In Proceedings of the Seventh IEEE International Conference on Computer Vision (Vol. 1, pp. 116–122). IEEE

work page 1999
[7]

J., et al

Pigou, L., Dieleman, S., Kindermans, P. J., et al. (2014, September). Sign language recognition using convolutional neural networks. In European Conference on Computer Vision (pp. 572–578). Springer, Cham

work page 2014
[8]

Speech Communication, 50(11-12), 1009-1020

San-Segundo,R.,Barra,R.,Córdoba,R.,etal.(2008).Speechtosignlanguagetranslation system for Spanish. Speech Communication, 50(11-12), 1009-1020. 12

work page 2008
[9]

Tokuda, M., & Okumura, M. (1998). Towards automatic translation from japanese into japanesesignlanguage.InAssistiveTechnologyandArtiﬁcialIntelligence(pp.97-108). Springer, Berlin, Heidelberg

work page 1998
[10]

M., Córdoba, R., et al

San-Segundo, R., Montero, J. M., Córdoba, R., et al. (2012). Design, development and ﬁeld evaluation of a Spanish into sign language translation system. Pattern Analysis and Applications, 15(2), 203-224

work page 2012
[11]

Othman, A., & Jemni, M. (2011). Statistical sign language machine translation: from English written text to American sign language gloss. arXiv preprint arXiv:1112.0168

work page internal anchor Pith review Pith/arXiv arXiv 2011
[12]

In Advances in neural information processing systems (pp

Sutskever,I.,Vinyals,O.,&Le,Q.V.(2014).Sequencetosequencelearningwithneural networks. In Advances in neural information processing systems (pp. 3104-3112)

work page 2014
[13]

Universal Access in the Information Society, 15(4), 551-566

McDonald,J.,Wolfe,R.,Schnepp,J.,etal.(2016).Anautomatedtechniqueforreal-time production of lifelike animations of American Sign Language. Universal Access in the Information Society, 15(4), 551-566

work page 2016
[14]

(2013, April)

Chai, X., Li, G., Lin, Y., et al. (2013, April). Sign language recognition and translation with kinect. In IEEE Conf. on AFGR (Vol. 655)

work page 2013
[15]

(2011, November)

Zafrulla, Z., Brashear, H., Starner, T., et al. (2011, November). American sign lan- guagerecognitionwiththekinect.InProceedingsofthe13thinternationalconferenceon multimodal interfaces (pp. 279-286). ACM

work page 2011
[16]

9 (8): 1735–1780

Hochreiter,S.,&Schmidhuber,J.(1997).LongShort-TermMemory.NeuralComputa- tion. 9 (8): 1735–1780

work page 1997
[17]

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Cao,Z.,Hidalgo,G.,Simon,T.,etal.(2018).OpenPose:realtimemulti-person2Dpose estimation using Part Aﬃnity Fields. arXiv preprint arXiv:1812.08008

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

On the Implementation of a Primal-Dual Interior Point Filter Line Search Algorithm for Large-Scale Nonlinear Programming

Wätcher, A., Biegler, L.T. “On the Implementation of a Primal-Dual Interior Point Filter Line Search Algorithm for Large-Scale Nonlinear Programming”, Mathematical Programming 106 (1): pp. 25–57, 2006

work page 2006
[19]

VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

Mehta, D., et al. “VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera”, ACM Transactions on Graphics 36 (4), 2017

work page 2017
[20]

Massive Exploration of Neural Machine Translation Architectures

Britz, D., Goldie, A., Luong, M. T., & Le, Q. (2017). Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906. 13 Appendix A. Training Data for Supervised Learning Training Input (x) Training Output ( y) ¿Qué tal? Tú Bien Estoy bien pero tengo sueño Bien Dormir ¿Tú vas al colegio? Tú Colegio Ir Venga, levántate, qu...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Herrero Blanco, Á., Abellán, A., & José, J. (1999). Fonología y escritura de la lengua de signos española. ELUA. Estudios de Lingüística, N. 13 (1999); pp. 89–116

work page 1999

[2] [2]

Gago, J., Victores, J., & Balaguer, C. (2019). Sign Language Representation by TEO HumanoidRobot:End-UserInterest,ComprehensionandSatisfaction.Electronics,8(1), 57

work page 2019

[3] [3]

Jaﬀe, D. L. (1994). Evolution of mechanical ﬁngerspelling hands for people who are deaf-blind. Journal of rehabilitation research and development, 31(3), 236–244

work page 1994

[4] [4]

Parton, B. S. (2005). Sign language recognition and translation: A multidisciplined ap- proachfromtheﬁeldofartiﬁcialintelligence.Journalofdeafstudiesanddeafeducation, 11(1), 94–101

work page 2005

[5] [5]

Starner, T., & Pentland, A. (1997). Real-time american sign language recognition from video using hidden markov models. In Motion-Based Recognition (pp. 227–243). Springer, Dordrecht

work page 1997

[6] [6]

Vogler, C., & Metaxas, D. (1999). Parallel hidden markov models for american sign language recognition. In Proceedings of the Seventh IEEE International Conference on Computer Vision (Vol. 1, pp. 116–122). IEEE

work page 1999

[7] [7]

J., et al

Pigou, L., Dieleman, S., Kindermans, P. J., et al. (2014, September). Sign language recognition using convolutional neural networks. In European Conference on Computer Vision (pp. 572–578). Springer, Cham

work page 2014

[8] [8]

Speech Communication, 50(11-12), 1009-1020

San-Segundo,R.,Barra,R.,Córdoba,R.,etal.(2008).Speechtosignlanguagetranslation system for Spanish. Speech Communication, 50(11-12), 1009-1020. 12

work page 2008

[9] [9]

Tokuda, M., & Okumura, M. (1998). Towards automatic translation from japanese into japanesesignlanguage.InAssistiveTechnologyandArtiﬁcialIntelligence(pp.97-108). Springer, Berlin, Heidelberg

work page 1998

[10] [10]

M., Córdoba, R., et al

San-Segundo, R., Montero, J. M., Córdoba, R., et al. (2012). Design, development and ﬁeld evaluation of a Spanish into sign language translation system. Pattern Analysis and Applications, 15(2), 203-224

work page 2012

[11] [11]

Othman, A., & Jemni, M. (2011). Statistical sign language machine translation: from English written text to American sign language gloss. arXiv preprint arXiv:1112.0168

work page internal anchor Pith review Pith/arXiv arXiv 2011

[12] [12]

In Advances in neural information processing systems (pp

Sutskever,I.,Vinyals,O.,&Le,Q.V.(2014).Sequencetosequencelearningwithneural networks. In Advances in neural information processing systems (pp. 3104-3112)

work page 2014

[13] [13]

Universal Access in the Information Society, 15(4), 551-566

McDonald,J.,Wolfe,R.,Schnepp,J.,etal.(2016).Anautomatedtechniqueforreal-time production of lifelike animations of American Sign Language. Universal Access in the Information Society, 15(4), 551-566

work page 2016

[14] [14]

(2013, April)

Chai, X., Li, G., Lin, Y., et al. (2013, April). Sign language recognition and translation with kinect. In IEEE Conf. on AFGR (Vol. 655)

work page 2013

[15] [15]

(2011, November)

Zafrulla, Z., Brashear, H., Starner, T., et al. (2011, November). American sign lan- guagerecognitionwiththekinect.InProceedingsofthe13thinternationalconferenceon multimodal interfaces (pp. 279-286). ACM

work page 2011

[16] [16]

9 (8): 1735–1780

Hochreiter,S.,&Schmidhuber,J.(1997).LongShort-TermMemory.NeuralComputa- tion. 9 (8): 1735–1780

work page 1997

[17] [17]

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Cao,Z.,Hidalgo,G.,Simon,T.,etal.(2018).OpenPose:realtimemulti-person2Dpose estimation using Part Aﬃnity Fields. arXiv preprint arXiv:1812.08008

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

On the Implementation of a Primal-Dual Interior Point Filter Line Search Algorithm for Large-Scale Nonlinear Programming

Wätcher, A., Biegler, L.T. “On the Implementation of a Primal-Dual Interior Point Filter Line Search Algorithm for Large-Scale Nonlinear Programming”, Mathematical Programming 106 (1): pp. 25–57, 2006

work page 2006

[19] [19]

VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

Mehta, D., et al. “VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera”, ACM Transactions on Graphics 36 (4), 2017

work page 2017

[20] [20]

Massive Exploration of Neural Machine Translation Architectures

Britz, D., Goldie, A., Luong, M. T., & Le, Q. (2017). Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906. 13 Appendix A. Training Data for Supervised Learning Training Input (x) Training Output ( y) ¿Qué tal? Tú Bien Estoy bien pero tengo sueño Bien Dormir ¿Tú vas al colegio? Tú Colegio Ir Venga, levántate, qu...

work page internal anchor Pith review Pith/arXiv arXiv 2017