Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents

Carson Yu Liu; Gelareh Mohammadi; Wafa Johal; Yang Song

arxiv: 2309.09346 · v1 · submitted 2023-09-17 · 💻 cs.AI · cs.RO

Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents

Carson Yu Liu , Gelareh Mohammadi , Yang Song , Wafa Johal This is my paper

Pith reviewed 2026-05-24 06:13 UTC · model grok-4.3

classification 💻 cs.AI cs.RO

keywords gesture generationconditional GANco-speech gesturesspeech to gestureembodied agentsjoint anglesnonverbal behaviorrobot interaction

0 comments

The pith

A conditional GAN generates sequences of joint angles from speech text and audio for embodied agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a conditional GAN can produce co-speech gesture sequences by learning mappings from both the meaning and sound properties of speech. This capability would let robots and virtual agents add natural nonverbal signals to their interactions without manual keyframing. The network is trained on paired speech and motion data from one speaker, then tested with both numerical metrics and human judgments. If the mapping holds, agents gain a practical way to synchronize body movement with spoken content in real time. The work focuses on showing this pipeline works end-to-end for the given dataset.

Core claim

Based on a conditional Generative Adversarial Network, the neural network model learns the relationships between the co-speech gestures and both semantic and acoustic features from the speech input, enabling generation of joint angle sequences from speech text and audio utterances, as shown effective by objective and subjective evaluations on a public dataset.

What carries the argument

conditional Generative Adversarial Network (GAN) that maps speech text and audio to sequences of joint angles

If this is right

Embodied agents can generate co-speech gestures automatically from ordinary speech input.
Both semantic content and acoustic properties of speech are used to shape the output movements.
The same framework applies to virtual agents and physical robots.
Objective measures and human subjective ratings both support the quality of the generated sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Retraining on data from additional speakers would likely be needed before the system could match individual gesture styles.
Coupling the model with live text-to-speech output could support fully autonomous spoken interaction.
Adaptation layers may be required when the target agent has different joint limits or body proportions.
The approach could be extended to generate gestures that respond to dialogue context beyond the current utterance.

Load-bearing premise

A dataset captured from one male native English speaker contains enough variety to train a model that generalizes to other speakers and different embodied agents.

What would settle it

Apply the trained model to speech recordings from a different speaker or language and observe whether human raters judge the resulting gestures as significantly less natural than those produced on the original dataset.

Figures

Figures reproduced from arXiv: 2309.09346 by Carson Yu Liu, Gelareh Mohammadi, Wafa Johal, Yang Song.

**Figure 2.** Figure 2: Gesture Generator First, we concatenate the text embedding, MFCCs and random noise as a long vector, then send them through to the two-layer bi-direction GRU (Gated recurrent unit) with 0.2 dropouts. Next, the vector passes through the following linear layer with the TanH activation function to reduce the dimensionality of the feature. In order to ensure the continuity of generated gestures, we used the fe… view at source ↗

**Figure 3.** Figure 3: Discriminator L W GAN G = − 1 N Xn i=1 D(sa,st, ˆgi ) (5) LD = 1 N Xn i=1 D(sa,st, ˆgi ) − 1 N Xn i=1 D(sa,st, gi ) (6) Where sa, st represent the speech audio and text features, respectively. Specifically, n is the total duration of the gesture sequence, gi and ˆgi are the ith original gesture and ith generated gesture, respectively. Using MSE (mean squared error) in Equation 3 and continuity loss in Equa… view at source ↗

**Figure 4.** Figure 4: Qualitative results. TABLE IV OBJECTIVE EVALUATION OF PROPOSED MODEL WITH THE STATE-OF-THE-ART. FOR METRICS: CLOSER TO THE GROUND TRUTH IS BETTER. ACCELERATION(ACC.). Model Acc.(cm/s2 ) Jerk(cm/s3 ) RMSE(cm) Gesticulator 63.8 ± 8.3 1332 ± 192 13.0±14.7 Proposed Model 94.48±19.64 2187.76±611.97 4.21±4.54 Ground Truth 144.7 ± 36.6 2322 ± 538 0 B. Subjective Evaluation Our user study was delivered via an anon… view at source ↗

**Figure 5.** Figure 5: Results of the user study A two-tailed T-test was used to determine if there was a statistically significant difference in the scores of the GT 1Sample from proposed group and sample from GT group 2HC No: HC220411 and proposed groups. Although the mean rating scores of the proposed model are lower than the ground truth, especially in semantic consistency, there was no statistically significant difference a… view at source ↗

read the original abstract

Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread. In human-human interactions, humans use nonverbal behaviours to convey their attitudes, feelings, and intentions. Therefore, this capability is also required for embodied agents in order to enhance the quality and effectiveness of their interactions with humans. In this paper, we propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances. Based on a conditional Generative Adversarial Network (GAN), our proposed neural network model learns the relationships between the co-speech gestures and both semantic and acoustic features from the speech input. In order to train our neural network model, we employ a public dataset containing co-speech gestures with corresponding speech audio utterances, which were captured from a single male native English speaker. The results from both objective and subjective evaluations demonstrate the efficacy of our gesture-generation framework for Robots and Embodied Agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Routine conditional GAN application to single-speaker co-speech gesture data with no demonstrated generalization beyond that narrow source.

read the letter

The paper presents a conditional GAN that takes speech text and audio and outputs sequences of joint angles for gestures. It trains and evaluates on a public dataset from one male native English speaker, then reports objective metrics plus subjective judgments to claim the framework works for robots and embodied agents. That is the core of it. The work applies an established generative model to this domain and includes both types of evaluation, which is better than papers that stop at qualitative examples. The dataset choice is public and the setup is straightforward, so someone replicating the method would have a clear starting point. The main limitation is the data. Everything stays within held-out portions of that single-speaker corpus, with no cross-speaker, cross-gender, or cross-embodiment tests shown. Gestures are highly individual, so the learned mapping is unlikely to transfer without additional evidence, and the abstract's broader claims about robots in general rest on that untested assumption. Technical details such as exact architecture, loss terms, and training procedure are not visible in the provided abstract, which makes it hard to judge whether any implementation choices go beyond standard practice. The stress-test note correctly flags the single-speaker constraint as the central issue for the generalizability claim. This paper is mainly for researchers already working on gesture synthesis in human-robot interaction who need another concrete GAN baseline or implementation reference. Readers looking for methodological novelty or results that hold across speakers will not find much here. It deserves peer review because the task is well-defined, the evaluations are attempted, and a referee can check the full implementation details and push on the generalization gap.

Referee Report

2 major / 0 minor

Summary. The paper proposes a conditional GAN framework that generates sequences of joint angles for co-speech gestures from speech text and audio utterances. The model learns mappings between gestures and semantic/acoustic speech features, trained on a public single-speaker dataset (one male native English speaker), with the authors stating that objective and subjective evaluations demonstrate its efficacy for robots and embodied agents.

Significance. If the central claim holds with adequate evidence, the work could offer a practical approach to generating natural nonverbal behaviors for embodied agents, potentially improving human-agent interaction quality. However, the narrow single-speaker training and evaluation setup provides limited support for the broad applicability claimed.

major comments (2)

[Abstract] Abstract: The claim that the framework demonstrates 'efficacy... for Robots and Embodied Agents' is load-bearing for the paper's contribution, yet all described training and both objective/subjective evaluations occur exclusively on held-out data from a single male native English speaker dataset, with no cross-speaker, cross-gender, or cross-embodiment transfer experiments reported. This directly undermines the generalizability required by the central claim.
[Abstract] Abstract: The description states that the conditional GAN 'learns the relationships between the co-speech gestures and both semantic and acoustic features' and that 'evaluations demonstrate the efficacy,' but supplies no architecture details, loss functions, input feature extraction methods, quantitative metrics, baselines, or error analysis. These omissions make the soundness of the empirical support impossible to assess from the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We respond to each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the framework demonstrates 'efficacy... for Robots and Embodied Agents' is load-bearing for the paper's contribution, yet all described training and both objective/subjective evaluations occur exclusively on held-out data from a single male native English speaker dataset, with no cross-speaker, cross-gender, or cross-embodiment transfer experiments reported. This directly undermines the generalizability required by the central claim.

Authors: We agree that all training and evaluation in the manuscript are performed on held-out data from a single male native English speaker, as explicitly noted in the abstract and methods. The public dataset used is limited to this speaker, and no cross-speaker, cross-gender, or cross-embodiment experiments are reported. While the conditional GAN framework is designed to learn general mappings from semantic and acoustic speech features to gesture sequences, the current evidence supports efficacy only within this narrow setting. To address the concern, we will revise the abstract to more precisely qualify the scope of the claimed efficacy and explicitly note the single-speaker limitation, thereby aligning the central claim with the reported evidence. revision: partial
Referee: [Abstract] Abstract: The description states that the conditional GAN 'learns the relationships between the co-speech gestures and both semantic and acoustic features' and that 'evaluations demonstrate the efficacy,' but supplies no architecture details, loss functions, input feature extraction methods, quantitative metrics, baselines, or error analysis. These omissions make the soundness of the empirical support impossible to assess from the provided text.

Authors: The abstract provides a high-level summary of the approach and results, consistent with standard abstract length constraints. Complete details on the model architecture, loss functions, input feature extraction, quantitative metrics, baselines, and error analysis are provided in the full manuscript (Sections 3 and 4). The soundness of the empirical support is therefore assessable from the complete paper rather than the abstract alone. No changes to the abstract are required on this point. revision: no

Circularity Check

0 steps flagged

No circularity; standard conditional GAN trained and evaluated on held-out single-speaker data

full rationale

The paper describes a conditional GAN that learns mappings from speech text/audio features to joint-angle gesture sequences, with training and both objective/subjective evaluations performed on held-out portions of one public single-speaker dataset. No equations, parameter-fitting steps, or derivation chain are presented that reduce a claimed prediction to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The approach is a conventional supervised generative model whose outputs are not definitionally equivalent to the training distribution; any limitation on generalizability to other speakers or embodiments is an empirical assumption, not a circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the single-speaker dataset is adequate for the stated generalization.

pith-pipeline@v0.9.0 · 5694 in / 1083 out tokens · 23939 ms · 2026-05-24T06:13:19.147901+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

Fisher and L

A. Fisher and L. Griswold, Evaluation of social interaction (ESI) . Fort Collins, CO: Three Star Press, 2010

work page 2010
[2]

Gesture as communication i: Its coordination with gaze and speech,

J. Streeck, “Gesture as communication i: Its coordination with gaze and speech,” Communications Monographs, vol. 60, no. 4, pp. 275– 299, 1993

work page 1993
[3]

Non-verbal signals in hri: Interference in human perception,

W. Johal, G. Calvary, and S. Pesty, “Non-verbal signals in hri: Interference in human perception,” in International Conference on Social Robotics. Springer, 2015, pp. 275–284

work page 2015
[4]

Co-speech gestures influence neural activity in brain regions associated with processing semantic information,

A. S. Dick, S. Goldin-Meadow, U. Hasson, J. I. Skipper, and S. L. Small, “Co-speech gestures influence neural activity in brain regions associated with processing semantic information,” Human brain map- ping, vol. 30, no. 11, pp. 3509–3526, 2009

work page 2009
[5]

Be- lieving in bert: Using expressive communication to enhance trust and counteract operational error in physical human-robot interaction,

A. Hamacher, N. Bianchi-Berthouze, A. G. Pipe, and K. Eder, “Be- lieving in bert: Using expressive communication to enhance trust and counteract operational error in physical human-robot interaction,” in 2016 25th IEEE international symposium on robot and human interactive communication (RO-MAN) . IEEE, 2016, pp. 493–500

work page 2016
[6]

McNeill, Hand and mind: What gestures reveal about thought

D. McNeill, Hand and mind: What gestures reveal about thought . University of Chicago press, 1992

work page 1992
[7]

Generative Adversarial Networks

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” arXiv preprint arXiv:1406.2661 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

Beat gesture generation rules for human-robot interaction,

P. Bremner, A. G. Pipe, M. Fraser, S. Subramanian, and C. Melhuish, “Beat gesture generation rules for human-robot interaction,” in RO- MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication . IEEE, 2009, pp. 1029–1034

work page 2009
[9]

Automated robot speech gesture generation system based on dialog sentence punctuation mark extraction,

J. Kim, W. H. Kim, W. H. Lee, J.-H. Seo, M. J. Chung, and D.- S. Kwon, “Automated robot speech gesture generation system based on dialog sentence punctuation mark extraction,” in 2012 IEEE/SICE International Symposium on System Integration (SII) . IEEE, 2012, pp. 645–647

work page 2012
[10]

Gesture encoding and reproduction for human-robot interaction in text-to-gesture systems,

H.-H. Kim, Y .-S. Ha, Z. Bien, and K.-H. Park, “Gesture encoding and reproduction for human-robot interaction in text-to-gesture systems,” Industrial Robot: An International Journal , 2012

work page 2012
[11]

Tts-driven synthetic behaviour- generation model for artificial bodies,

I. Mlakar, Z. Ka ˇciˇc, and M. Rojc, “Tts-driven synthetic behaviour- generation model for artificial bodies,” International Journal of Ad- vanced Robotic Systems , vol. 10, no. 10, p. 344, 2013

work page 2013
[12]

Generating iconic ges- tures based on graphic data analysis and clustering,

Y . Kadono, Y . Takase, and Y . I. Nakano, “Generating iconic ges- tures based on graphic data analysis and clustering,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 2016, pp. 447–448

work page 2016
[13]

Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,

Y . Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee, “Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 4303–4309

work page 2019
[14]

A speech- driven hand gesture generation method and evaluation in android robots,

C. T. Ishi, D. Machiyashiki, R. Mikata, and H. Ishiguro, “A speech- driven hand gesture generation method and evaluation in android robots,” IEEE Robotics and Automation Letters , vol. 3, no. 4, pp. 3757–3764, 2018

work page 2018
[15]

Evaluation of speech-to-gesture generation using bi-directional lstm network,

D. Hasegawa, N. Kaneko, S. Shirakawa, H. Sakuta, and K. Sumi, “Evaluation of speech-to-gesture generation using bi-directional lstm network,” in Proceedings of the 18th International Conference on Intelligent Virtual Agents, 2018, pp. 79–86

work page 2018
[16]

Analyzing input and output representations for speech-driven gesture generation,

T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, and H. Kjell- str¨om, “Analyzing input and output representations for speech-driven gesture generation,” in Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents , 2019, pp. 97–104

work page 2019
[17]

Multi-objective adversarial gesture generation,

Y . Ferstl, M. Neff, and R. McDonnell, “Multi-objective adversarial gesture generation,” in Motion, Interaction and Games , 2019, pp. 1– 10

work page 2019
[18]

Gesticulator: A framework for semantically-aware speech-driven gesture generation,

T. Kucherenko, P. Jonell, S. van Waveren, G. E. Henter, S. Alexan- dersson, I. Leite, and H. Kjellstr ¨om, “Gesticulator: A framework for semantically-aware speech-driven gesture generation,” in Proceedings of the 2020 International Conference on Multimodal Interaction, 2020, pp. 242–250

work page 2020
[19]

Speech gesture generation from the trimodal context of text, audio, and speaker identity,

Y . Yoon, B. Cha, J.-H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee, “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” ACM Transactions on Graphics (TOG), vol. 39, no. 6, pp. 1–16, 2020

work page 2020
[20]

Speech-to-gesture generation: A challenge in deep learning approach with bi-directional lstm,

K. Takeuchi, D. Hasegawa, S. Shirakawa, N. Kaneko, H. Sakuta, and K. Sumi, “Speech-to-gesture generation: A challenge in deep learning approach with bi-directional lstm,” in Proceedings of the 5th International Conference on Human Agent Interaction, 2017, pp. 365– 369

work page 2017
[21]

How to train your avatar: A data driven approach to gesture generation,

C.-C. Chiu and S. Marsella, “How to train your avatar: A data driven approach to gesture generation,” in International Workshop on Intelligent Virtual Agents. Springer, 2011, pp. 127–140

work page 2011
[22]

BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,

H. Liu, Z. Zhu, N. Iwamoto, Y . Peng, Z. Li, Y . Zhou, E. Bozkurt, and B. Zheng, “BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” in Lecture Notes in Computer Science. Springer Nature Switzerland, 2022, pp. 612–630. [Online]. Available: https://doi.org/10.1007/978-3-031-20071-7 36

work page doi:10.1007/978-3-031-20071-7 2022
[23]

Investigating the use of recurrent motion modelling for speech gesture generation,

Y . Ferstl and R. McDonnell, “Investigating the use of recurrent motion modelling for speech gesture generation,” in Proceedings of the 18th International Conference on Intelligent Virtual Agents , 2018, pp. 93– 98

work page 2018
[24]

Modeling the conditional distribution of co-speech upper body gesture jointly using conditional- gan and unrolled-gan,

B. Wu, C. Liu, C. T. Ishi, and H. Ishiguro, “Modeling the conditional distribution of co-speech upper body gesture jointly using conditional- gan and unrolled-gan,” Electronics, vol. 10, no. 3, p. 228, 2021

work page 2021
[25]

Practical parameterization of rotations using the ex- ponential map,

F. S. Grassia, “Practical parameterization of rotations using the ex- ponential map,” Journal of graphics tools , vol. 3, no. 3, pp. 29–48, 1998

work page 1998
[26]

Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach,

C. Ahuja, D. W. Lee, Y . I. Nakano, and L.-P. Morency, “Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach,” in European Conference on Computer Vision . Springer, 2020, pp. 248–265

work page 2020
[27]

Style- controllable speech-driven gesture synthesis using normalising flows,

S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow, “Style- controllable speech-driven gesture synthesis using normalising flows,” in Computer Graphics Forum, vol. 39, no. 2. Wiley Online Library, 2020, pp. 487–496

work page 2020
[28]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 32, no. 1, 2018

work page 2018
[30]

End-to-End Speech-Driven Facial Animation with Temporal GANs

K. V ougioukas, S. Petridis, and M. Pantic, “End-to-end speech- driven facial animation with temporal gans,” arXiv preprint arXiv:1805.09313, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Wasserstein generative ad- versarial networks,

M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative ad- versarial networks,” in International conference on machine learning . PMLR, 2017, pp. 214–223

work page 2017
[32]

Speech-based gesture generation for robots and embodied agents: A scoping review,

Y . Liu, G. Mohammadi, Y . Song, and W. Johal, “Speech-based gesture generation for robots and embodied agents: A scoping review,” in Proceedings of the 9th International Conference on Human-Agent Interaction, 2021, pp. 31–38

work page 2021
[33]

A review of evaluation practices of gesture generation in embodied conversational agents,

P. Wolfert, N. Robinson, and T. Belpaeme, “A review of evaluation practices of gesture generation in embodied conversational agents,” IEEE Transactions on Human-Machine Systems , 2022

work page 2022

[1] [1]

Fisher and L

A. Fisher and L. Griswold, Evaluation of social interaction (ESI) . Fort Collins, CO: Three Star Press, 2010

work page 2010

[2] [2]

Gesture as communication i: Its coordination with gaze and speech,

J. Streeck, “Gesture as communication i: Its coordination with gaze and speech,” Communications Monographs, vol. 60, no. 4, pp. 275– 299, 1993

work page 1993

[3] [3]

Non-verbal signals in hri: Interference in human perception,

W. Johal, G. Calvary, and S. Pesty, “Non-verbal signals in hri: Interference in human perception,” in International Conference on Social Robotics. Springer, 2015, pp. 275–284

work page 2015

[4] [4]

Co-speech gestures influence neural activity in brain regions associated with processing semantic information,

A. S. Dick, S. Goldin-Meadow, U. Hasson, J. I. Skipper, and S. L. Small, “Co-speech gestures influence neural activity in brain regions associated with processing semantic information,” Human brain map- ping, vol. 30, no. 11, pp. 3509–3526, 2009

work page 2009

[5] [5]

Be- lieving in bert: Using expressive communication to enhance trust and counteract operational error in physical human-robot interaction,

A. Hamacher, N. Bianchi-Berthouze, A. G. Pipe, and K. Eder, “Be- lieving in bert: Using expressive communication to enhance trust and counteract operational error in physical human-robot interaction,” in 2016 25th IEEE international symposium on robot and human interactive communication (RO-MAN) . IEEE, 2016, pp. 493–500

work page 2016

[6] [6]

McNeill, Hand and mind: What gestures reveal about thought

D. McNeill, Hand and mind: What gestures reveal about thought . University of Chicago press, 1992

work page 1992

[7] [7]

Generative Adversarial Networks

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” arXiv preprint arXiv:1406.2661 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[8] [8]

Beat gesture generation rules for human-robot interaction,

P. Bremner, A. G. Pipe, M. Fraser, S. Subramanian, and C. Melhuish, “Beat gesture generation rules for human-robot interaction,” in RO- MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication . IEEE, 2009, pp. 1029–1034

work page 2009

[9] [9]

Automated robot speech gesture generation system based on dialog sentence punctuation mark extraction,

J. Kim, W. H. Kim, W. H. Lee, J.-H. Seo, M. J. Chung, and D.- S. Kwon, “Automated robot speech gesture generation system based on dialog sentence punctuation mark extraction,” in 2012 IEEE/SICE International Symposium on System Integration (SII) . IEEE, 2012, pp. 645–647

work page 2012

[10] [10]

Gesture encoding and reproduction for human-robot interaction in text-to-gesture systems,

H.-H. Kim, Y .-S. Ha, Z. Bien, and K.-H. Park, “Gesture encoding and reproduction for human-robot interaction in text-to-gesture systems,” Industrial Robot: An International Journal , 2012

work page 2012

[11] [11]

Tts-driven synthetic behaviour- generation model for artificial bodies,

I. Mlakar, Z. Ka ˇciˇc, and M. Rojc, “Tts-driven synthetic behaviour- generation model for artificial bodies,” International Journal of Ad- vanced Robotic Systems , vol. 10, no. 10, p. 344, 2013

work page 2013

[12] [12]

Generating iconic ges- tures based on graphic data analysis and clustering,

Y . Kadono, Y . Takase, and Y . I. Nakano, “Generating iconic ges- tures based on graphic data analysis and clustering,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 2016, pp. 447–448

work page 2016

[13] [13]

Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,

Y . Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee, “Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 4303–4309

work page 2019

[14] [14]

A speech- driven hand gesture generation method and evaluation in android robots,

C. T. Ishi, D. Machiyashiki, R. Mikata, and H. Ishiguro, “A speech- driven hand gesture generation method and evaluation in android robots,” IEEE Robotics and Automation Letters , vol. 3, no. 4, pp. 3757–3764, 2018

work page 2018

[15] [15]

Evaluation of speech-to-gesture generation using bi-directional lstm network,

D. Hasegawa, N. Kaneko, S. Shirakawa, H. Sakuta, and K. Sumi, “Evaluation of speech-to-gesture generation using bi-directional lstm network,” in Proceedings of the 18th International Conference on Intelligent Virtual Agents, 2018, pp. 79–86

work page 2018

[16] [16]

Analyzing input and output representations for speech-driven gesture generation,

T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, and H. Kjell- str¨om, “Analyzing input and output representations for speech-driven gesture generation,” in Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents , 2019, pp. 97–104

work page 2019

[17] [17]

Multi-objective adversarial gesture generation,

Y . Ferstl, M. Neff, and R. McDonnell, “Multi-objective adversarial gesture generation,” in Motion, Interaction and Games , 2019, pp. 1– 10

work page 2019

[18] [18]

Gesticulator: A framework for semantically-aware speech-driven gesture generation,

T. Kucherenko, P. Jonell, S. van Waveren, G. E. Henter, S. Alexan- dersson, I. Leite, and H. Kjellstr ¨om, “Gesticulator: A framework for semantically-aware speech-driven gesture generation,” in Proceedings of the 2020 International Conference on Multimodal Interaction, 2020, pp. 242–250

work page 2020

[19] [19]

Speech gesture generation from the trimodal context of text, audio, and speaker identity,

Y . Yoon, B. Cha, J.-H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee, “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” ACM Transactions on Graphics (TOG), vol. 39, no. 6, pp. 1–16, 2020

work page 2020

[20] [20]

Speech-to-gesture generation: A challenge in deep learning approach with bi-directional lstm,

K. Takeuchi, D. Hasegawa, S. Shirakawa, N. Kaneko, H. Sakuta, and K. Sumi, “Speech-to-gesture generation: A challenge in deep learning approach with bi-directional lstm,” in Proceedings of the 5th International Conference on Human Agent Interaction, 2017, pp. 365– 369

work page 2017

[21] [21]

How to train your avatar: A data driven approach to gesture generation,

C.-C. Chiu and S. Marsella, “How to train your avatar: A data driven approach to gesture generation,” in International Workshop on Intelligent Virtual Agents. Springer, 2011, pp. 127–140

work page 2011

[22] [22]

BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,

H. Liu, Z. Zhu, N. Iwamoto, Y . Peng, Z. Li, Y . Zhou, E. Bozkurt, and B. Zheng, “BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” in Lecture Notes in Computer Science. Springer Nature Switzerland, 2022, pp. 612–630. [Online]. Available: https://doi.org/10.1007/978-3-031-20071-7 36

work page doi:10.1007/978-3-031-20071-7 2022

[23] [23]

Investigating the use of recurrent motion modelling for speech gesture generation,

Y . Ferstl and R. McDonnell, “Investigating the use of recurrent motion modelling for speech gesture generation,” in Proceedings of the 18th International Conference on Intelligent Virtual Agents , 2018, pp. 93– 98

work page 2018

[24] [24]

Modeling the conditional distribution of co-speech upper body gesture jointly using conditional- gan and unrolled-gan,

B. Wu, C. Liu, C. T. Ishi, and H. Ishiguro, “Modeling the conditional distribution of co-speech upper body gesture jointly using conditional- gan and unrolled-gan,” Electronics, vol. 10, no. 3, p. 228, 2021

work page 2021

[25] [25]

Practical parameterization of rotations using the ex- ponential map,

F. S. Grassia, “Practical parameterization of rotations using the ex- ponential map,” Journal of graphics tools , vol. 3, no. 3, pp. 29–48, 1998

work page 1998

[26] [26]

Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach,

C. Ahuja, D. W. Lee, Y . I. Nakano, and L.-P. Morency, “Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach,” in European Conference on Computer Vision . Springer, 2020, pp. 248–265

work page 2020

[27] [27]

Style- controllable speech-driven gesture synthesis using normalising flows,

S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow, “Style- controllable speech-driven gesture synthesis using normalising flows,” in Computer Graphics Forum, vol. 39, no. 2. Wiley Online Library, 2020, pp. 487–496

work page 2020

[28] [28]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 32, no. 1, 2018

work page 2018

[30] [30]

End-to-End Speech-Driven Facial Animation with Temporal GANs

K. V ougioukas, S. Petridis, and M. Pantic, “End-to-end speech- driven facial animation with temporal gans,” arXiv preprint arXiv:1805.09313, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Wasserstein generative ad- versarial networks,

M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative ad- versarial networks,” in International conference on machine learning . PMLR, 2017, pp. 214–223

work page 2017

[32] [32]

Speech-based gesture generation for robots and embodied agents: A scoping review,

Y . Liu, G. Mohammadi, Y . Song, and W. Johal, “Speech-based gesture generation for robots and embodied agents: A scoping review,” in Proceedings of the 9th International Conference on Human-Agent Interaction, 2021, pp. 31–38

work page 2021

[33] [33]

A review of evaluation practices of gesture generation in embodied conversational agents,

P. Wolfert, N. Robinson, and T. Belpaeme, “A review of evaluation practices of gesture generation in embodied conversational agents,” IEEE Transactions on Human-Machine Systems , 2022

work page 2022