Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

Christian Arzate Cruz; Edwin C. Montiel-Vazquez; Giorgos Giannakakis; Randy Gomez; Stefanos Gkikas; Thomas Kassiotis

arxiv: 2604.11417 · v5 · pith:7TVTQ43Snew · submitted 2026-04-13 · 💻 cs.RO · cs.AI

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

Edwin C. Montiel-Vazquez , Christian Arzate Cruz , Stefanos Gkikas , Thomas Kassiotis , Giorgos Giannakakis , Randy Gomez This is my paper

Pith reviewed 2026-05-21 00:16 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords co-speech gesturesiconic gesturesrobot gesture generationemotion-aware predictiontransformer modelhuman-robot interactiongesture intensity

0 comments

The pith

A lightweight transformer predicts iconic gesture placement and intensity for robots from text and emotion alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a compact transformer model designed to generate semantic gestures that robots can use while speaking, based solely on the words being said and the emotional tone. This stands in contrast to most existing systems that produce only rhythmic, non-meaningful movements. A sympathetic reader would care because successful co-speech gestures can make robot interactions more engaging and easier to understand, potentially advancing practical deployment in social robotics. The model is shown to beat a much larger general-purpose AI in specific tasks of deciding when and how strongly to gesture, all while using fewer computational resources.

Core claim

The lightweight transformer derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

What carries the argument

The lightweight transformer that processes text and emotion features to output gesture placement classifications and intensity values.

If this is right

Robots can perform meaningful iconic gestures in real time without processing audio signals.
Gesture generation becomes feasible on devices with limited computing power.
Semantic and emotional cues from text suffice for accurate gesture prediction in controlled datasets.
Interactions with robots may become more natural and informative when gestures align with spoken content and affect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar models might be adapted to other embodied agents beyond robots, such as virtual avatars.
Future work could test whether combining this with minimal audio features further improves results in noisy environments.
The reliance on a specific dataset suggests the need for validation across different languages and cultural gesture styles.

Load-bearing premise

That the BEAT2 dataset provides enough variety in gestures so that results will hold up in actual robot conversations with people.

What would settle it

A demonstration where the model produces inappropriate or missing gestures when tested on live speech from speakers outside the training data distribution.

Figures

Figures reproduced from arXiv: 2604.11417 by Christian Arzate Cruz, Edwin C. Montiel-Vazquez, Giorgos Giannakakis, Randy Gomez, Stefanos Gkikas, Thomas Kassiotis.

**Figure 2.** Figure 2: High-level overview of the proposed model. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lightweight transformer predicts iconic gestures from text and emotion without audio and claims to beat GPT-4o on BEAT2, but the comparison details are missing.

read the letter

The core contribution is a small transformer that takes text plus emotion labels and outputs placement and intensity for iconic co-speech gestures, with no audio required once the model is trained. This setup targets real-time robot use where full audio pipelines can be heavy or unavailable. The work does a reasonable job highlighting the gap between typical beat-gesture systems and semantic ones, and it positions the model as compact enough for embodied deployment. That practical angle is the clearest strength so far. The outperformance numbers versus GPT-4o on BEAT2 classification and regression are the main empirical hook, and if the training code and splits are reproducible that would count as solid evidence of a usable baseline. The paper also keeps the scope narrow, which avoids overclaiming. The main soft spot is the GPT-4o baseline. The abstract does not describe the prompt or input features given to it, so it is unclear whether the comparison used identical text-and-emotion constraints or gave the larger model extra context such as prosody. Without that information the reported gains cannot be isolated to the proposed architecture. Training procedure, exact data splits, statistical tests, and error breakdowns are also absent from the summary, which leaves the soundness of the central claim low until those sections are checked. The assumption that BEAT2 examples will generalize to varied robot interactions is plausible but untested in the current write-up. This paper is for robotics and HRI researchers who need a lightweight semantic-gesture module rather than a general theory advance. Readers working on real-time embodied agents would find the no-audio design useful even if they end up re-running the experiments with tighter controls. It deserves a serious referee because the idea is concrete and the deployment constraints are realistic, though the review will need to focus on experimental transparency and baseline fairness. I would send it to review with a request for the GPT-4o prompt template and full ablation tables.

Referee Report

2 major / 1 minor

Summary. The paper proposes a lightweight transformer that predicts placement and intensity of iconic (semantic) co-speech gestures for robots using only text and emotion inputs, with no audio required at inference time. It claims this model outperforms GPT-4o on both semantic gesture placement classification and intensity regression tasks when evaluated on the BEAT2 dataset, while remaining compact enough for real-time embodied deployment.

Significance. If the reported outperformance holds under matched input conditions, the work would demonstrate a practical, audio-free approach to semantically and affectively grounded gesture generation that is more efficient than prompting large general-purpose models. This could support real-time robot systems that integrate emotional cues directly from text without relying on prosodic or acoustic features.

major comments (2)

[Abstract / Results] Abstract and results section: The central claim that the proposed model outperforms GPT-4o on BEAT2 for placement classification and intensity regression is load-bearing for the no-audio-at-inference advantage, yet the manuscript provides no description of the GPT-4o prompt template, input feature vector, output parsing procedure, or whether GPT-4o was restricted to the same text+emotion inputs. Without these details the comparison cannot be verified as isolating the contribution of the lightweight transformer.
[Experiments] Experiments section: No information is given on training procedure, data splits, baseline implementations, statistical tests, or error analysis for the BEAT2 evaluation. This absence prevents assessment of whether the reported gains are robust or could be explained by differences in evaluation protocol.

minor comments (1)

[Method] Notation for gesture intensity regression output is introduced without an explicit equation or loss function definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the clarity and verifiability of our comparisons and experimental details. We will revise the manuscript to address both major comments fully.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results section: The central claim that the proposed model outperforms GPT-4o on BEAT2 for placement classification and intensity regression is load-bearing for the no-audio-at-inference advantage, yet the manuscript provides no description of the GPT-4o prompt template, input feature vector, output parsing procedure, or whether GPT-4o was restricted to the same text+emotion inputs. Without these details the comparison cannot be verified as isolating the contribution of the lightweight transformer.

Authors: We agree that these implementation details are necessary to substantiate the comparison and isolate the contribution of our lightweight transformer under matched conditions. In the revised manuscript, we will add a dedicated subsection describing the exact GPT-4o prompt template, confirm that inputs were restricted to the same text and emotion labels used by our model (with no audio or additional features), specify the input feature vector construction, and detail the output parsing procedure for extracting placement classifications and intensity regressions. This will enable direct verification of the no-audio-at-inference advantage. revision: yes
Referee: [Experiments] Experiments section: No information is given on training procedure, data splits, baseline implementations, statistical tests, or error analysis for the BEAT2 evaluation. This absence prevents assessment of whether the reported gains are robust or could be explained by differences in evaluation protocol.

Authors: We acknowledge that the current manuscript omits key experimental details for brevity. We will expand the Experiments section to include: the full training procedure (hyperparameters, optimizer, loss functions, and epochs); the precise data splits on BEAT2 (e.g., train/validation/test ratios and any speaker-independent partitioning); implementation details for all baselines including GPT-4o; results of statistical tests (such as paired t-tests or Wilcoxon tests for significance of differences in placement accuracy and intensity MSE); and an error analysis with qualitative examples of success and failure cases. These additions will allow readers to assess the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; model trained and evaluated on external dataset

full rationale

The paper trains a lightweight transformer on the external BEAT2 dataset to map text and emotion features to iconic gesture placement and intensity, then reports outperformance versus GPT-4o on held-out test data. No equations, self-citations, or fitted parameters are shown to reduce the central claims to their own inputs by construction. The derivation chain remains independent of the reported results and does not rely on renaming, ansatz smuggling, or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the work appears to rest on standard supervised learning assumptions for transformers rather than novel axioms or invented entities.

axioms (1)

domain assumption Standard transformer training assumptions including sufficient labeled data in BEAT2 for learning gesture placement and intensity from text embeddings and emotion labels.
Invoked implicitly by the claim of training a model that generalizes from the dataset.

pith-pipeline@v0.9.0 · 5621 in / 1142 out tokens · 42505 ms · 2026-05-21T00:16:22.575874+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone... cross-attention... self-attention... Fourier feature encoding
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

outperforms GPT-4o... on BEAT2 dataset... latency of 1.16 ms

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

A comprehensive review of data-driven co-speech gesture generation,

S. Nyatsanga, T. Kucherenko, C. Ahuja, G. E. Henter, and M. Neff, “A comprehensive review of data-driven co-speech gesture generation,” in Computer Graphics Forum, vol. 42, no. 2. Wiley Online Library, 2023, pp. 569–596

work page 2023
[2]

Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling,

H. Liu, Z. Zhu, G. Becherini, Y . Peng, M. Su, Y . Zhou, X. Zhe, N. Iwamoto, B. Zheng, and M. J. Black, “Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1144–1154

work page 2024
[3]

Semtalk: Holistic co-speech motion generation with frame-level semantic emphasis,

X. Zhang, J. Li, J. Zhang, Z. Dang, J. Ren, L. Bo, and Z. Tu, “Semtalk: Holistic co-speech motion generation with frame-level semantic emphasis,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 761–13 771

work page 2025
[4]

Gesture modeling and animation based on a probabilistic re-creation of speaker style,

M. Neff, M. Kipp, I. Albrecht, and H.-P. Seidel, “Gesture modeling and animation based on a probabilistic re-creation of speaker style,”ACM Transactions On Graphics (TOG), vol. 27, no. 1, pp. 1–24, 2008

work page 2008
[5]

Gesticulator: A framework for semantically- aware speech-driven gesture generation,

T. Kucherenko, P. Jonell, S. Van Waveren, G. E. Henter, S. Alexandersson, I. Leite, and H. Kjellstr ¨om, “Gesticulator: A framework for semantically- aware speech-driven gesture generation,” inProceedings of the 2020 international conference on multimodal interaction, 2020, pp. 242–250

work page 2020
[6]

Probabilistic human-like gesture synthesis from speech using gru-based wgan,

B. Wu, C. Liu, C. T. Ishi, and H. Ishiguro, “Probabilistic human-like gesture synthesis from speech using gru-based wgan,” inCompanion pub- lication of the 2021 international conference on multimodal interaction, 2021, pp. 194–201

work page 2021
[7]

Impact of personality on generation of co-speech nonverbal behaviors represented by 3d skeleton pose,

R. Ishii, S. Eitoku, and Y . Sato, “Impact of personality on generation of co-speech nonverbal behaviors represented by 3d skeleton pose,” inProceedings of the 13th International Conference on Human-Agent Interaction, 2025, pp. 247–256

work page 2025
[8]

The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice,

R. Plutchik, “The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice,”American scientist, vol. 89, no. 4, pp. 344–350, 2001

work page 2001
[9]

Twifly: A data analysis framework for twitter,

P. Chatziadam, A. Dimitriadis, S. Gikas, I. Logothetis, M. Michalodim- itrakis, M. Neratzoulakis, A. Papadakis, V . Kontoulis, N. Siganos, D. Theodoropoulos, G. V ougioukalos, I. Hatzakis, G. Gerakis, N. Pa- padakis, and H. Kondylakis, “Twifly: A data analysis framework for twitter,”Information, vol. 11, no. 5, 2020

work page 2020
[10]

Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,

Y . Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee, “Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 4303–4309

work page 2019
[11]

A learning-based co-speech gesture generation system for social robots,

X. Li and C. Dondrup, “A learning-based co-speech gesture generation system for social robots,” inProceedings of the 12th International Conference on Human-Agent Interaction, 2024, pp. 453–455

work page 2024
[12]

Evaluating the effect of co- speech gesture prediction on human–robot interaction,

E. Fern ´andez-Rodicio, J. J. Gamboa-Montero, M. Maroto-G ´omez, ´A. Castro-Gonz ´alez, and M. A. Salichs, “Evaluating the effect of co- speech gesture prediction on human–robot interaction,”International Journal of Human-Computer Studies, p. 103674, 2025

work page 2025
[13]

Co-speech gesture and facial expression generation for non-photorealistic 3d characters,

T. Omine, N. Kawabata, and F. Homma, “Co-speech gesture and facial expression generation for non-photorealistic 3d characters,” in Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Posters, 2025, pp. 1–2

work page 2025
[14]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,

H. Liu, Z. Zhu, N. Iwamoto, Y . Peng, Z. Li, Y . Zhou, E. Bozkurt, and B. Zheng, “Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” inEuropean conference on computer vision. Springer, 2022, pp. 612–630

work page 2022
[15]

Sarges: Semantically aligned reliable gesture generation via intent chain,

N. Gao, Y . Bao, D. Weng, J. Zhao, J. Li, Y . Zhou, and P. Wan, “Sarges: Semantically aligned reliable gesture generation via intent chain,” in Proceedings of the International Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents, 2025, pp. 13–21

work page 2025
[16]

Long short-term memory,

S. Hochreiter, “Long short-term memory,”Neural Computation MIT- Press, 1997

work page 1997
[17]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[18]

Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models,

S. Yang, Z. Wu, M. Li, Z. Zhang, L. Hao, W. Bao, M. Cheng, and L. Xiao, “Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models,”arXiv preprint arXiv:2305.04919, 2023

work page arXiv 2023
[19]

Andorid robot motion generation based on video-recorded human demonstrations,

D.-S. Go, H.-J. Hyung, D.-W. Lee, and H. U. Yoon, “Andorid robot motion generation based on video-recorded human demonstrations,” in2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 2018, pp. 476–478

work page 2018
[20]

Beat gesture generation rules for human-robot interaction,

P. Bremner, A. G. Pipe, M. Fraser, S. Subramanian, and C. Melhuish, “Beat gesture generation rules for human-robot interaction,” inRO-MAN 2009-the 18th IEEE international Symposium on Robot and human interactive communication. IEEE, 2009, pp. 1029–1034

work page 2009
[21]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019
[22]

Speech-gesture gan: Gesture generation for robots and embodied agents,

C. Y . Liu, G. Mohammadi, Y . Song, and W. Johal, “Speech-gesture gan: Gesture generation for robots and embodied agents,” in2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 2023, pp. 405–412

work page 2023
[23]

Srg 3: Speech-driven robot gesture generation with gan,

C. Yu and A. Tapus, “Srg 3: Speech-driven robot gesture generation with gan,” in2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV). IEEE, 2020, pp. 759–766

work page 2020
[24]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[25]

Emo2Vec: Learning Generalized Emotion Representation by Multi-task Training

P. Xu, A. Madotto, C.-S. Wu, J. H. Park, and P. Fung, “Emo2vec: Learning generalized emotion representation by multi-task training,”arXiv preprint arXiv:1809.04505, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Efficient pain recognition via respiration signals: A single cross-attention transformer multi-window fusion pipeline,

S. Gkikas, I. Kyprakis, and M. Tsiknakis, “Efficient pain recognition via respiration signals: A single cross-attention transformer multi-window fusion pipeline,” inCompanion Proceedings of the 27th International Conference on Multimodal Interaction, ser. ICMI Companion ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 70–79

work page 2025
[27]

Synthetic thermal and rgb videos for automatic pain assessment utilizing a vision-mlp architecture,

S. Gkikas and M. Tsiknakis, “Synthetic thermal and rgb videos for automatic pain assessment utilizing a vision-mlp architecture,” in2024 12th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), 2024, pp. 4–12

work page 2024
[28]

A lightweight transformer for pain recognition from brain activity,

S. Gkikas, C. A. Cruz, Y . Fang, L. Cao, M. U. Khan, T. Kassiotis, G. Giannakakis, R. F. Rojas, and R. Gomez, “A lightweight transformer for pain recognition from brain activity,” 2026

work page 2026
[29]

1bt: One-block transformer for eeg-based cognitive workload assessment,

S. Gkikas, C. A. Cruz, T. Kassiotis, G. Giannakakis, R. F. Rojas, and R. Gomez, “1bt: One-block transformer for eeg-based cognitive workload assessment,” 2026

work page 2026
[30]

An Explainable Artificial Intelligence Approach for Detecting Empathy in Textual Communication,

E. C. Montiel-V ´azquez, J. A. Ram ´ırez Uresti, and O. Loyola-Gonz ´alez, “An Explainable Artificial Intelligence Approach for Detecting Empathy in Textual Communication,”Applied Sciences, vol. 12, no. 19, p. 9407, Sep. 2022

work page 2022
[31]

Empatheticexchanges: Toward understanding the cues for empathy in dyadic conversations,

E. C. Montiel-V ´azquez, C. Arzate Cruz, J. A. R. Uresti, and R. Gomez, “Empatheticexchanges: Toward understanding the cues for empathy in dyadic conversations,”IEEE Access, vol. 12, pp. 195 097–195 110, 2024

work page 2024
[32]

GPT-4 technical report

“GPT-4 technical report.” [Online]. Available: http://arxiv.org/abs/2303. 08774

work page
[33]

Meaning and understanding in large language models,

V . Havl´ık, “Meaning and understanding in large language models,” Synthese, vol. 205, no. 1, p. 9, 2024

work page 2024
[34]

Haru: Hardware design of an experimental tabletop robot assistant,

R. Gomez, D. Szapiro, K. Galindo, and K. Nakamura, “Haru: Hardware design of an experimental tabletop robot assistant,” inProceedings of the 2018 ACM/IEEE international conference on human-robot interaction, 2018, pp. 233–240

work page 2018
[35]

A view on edge caching applications,

D. Antonogiorgakis, A. Britzolakis, P. Chatziadam, A. Dimitriadis, S. Gikas, E. Michalodimitrakis, M. Oikonomakis, N. Siganos, E. Tzagkarakis, Y . Nikoloudakis, S. Panagiotakis, E. Pallis, and E. K. Markakis, “A view on edge caching applications,” 2019. [Online]. Available: https://arxiv.org/abs/1907.12359

work page arXiv 2019
[36]

Data augmentation for 3dmm-based arousal-valence prediction for hri,

C. A. Cruz, Y . Sechayk, T. Igarashi, and R. Gomez, “Data augmentation for 3dmm-based arousal-valence prediction for hri,” in2024 33rd IEEE International Conference on Robot and Human Interactive Communica- tion (ROMAN), 2024, pp. 2015–2022

work page 2024
[37]

A visual perceptual perspective on gaze in social robotics,

R. S. Hessels and Y . Fang, “A visual perceptual perspective on gaze in social robotics,”Psychonomic Bulletin & Review, vol. 33, no. 4, p. 131, 2026

work page 2026

[1] [1]

A comprehensive review of data-driven co-speech gesture generation,

S. Nyatsanga, T. Kucherenko, C. Ahuja, G. E. Henter, and M. Neff, “A comprehensive review of data-driven co-speech gesture generation,” in Computer Graphics Forum, vol. 42, no. 2. Wiley Online Library, 2023, pp. 569–596

work page 2023

[2] [2]

Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling,

H. Liu, Z. Zhu, G. Becherini, Y . Peng, M. Su, Y . Zhou, X. Zhe, N. Iwamoto, B. Zheng, and M. J. Black, “Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1144–1154

work page 2024

[3] [3]

Semtalk: Holistic co-speech motion generation with frame-level semantic emphasis,

X. Zhang, J. Li, J. Zhang, Z. Dang, J. Ren, L. Bo, and Z. Tu, “Semtalk: Holistic co-speech motion generation with frame-level semantic emphasis,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 761–13 771

work page 2025

[4] [4]

Gesture modeling and animation based on a probabilistic re-creation of speaker style,

M. Neff, M. Kipp, I. Albrecht, and H.-P. Seidel, “Gesture modeling and animation based on a probabilistic re-creation of speaker style,”ACM Transactions On Graphics (TOG), vol. 27, no. 1, pp. 1–24, 2008

work page 2008

[5] [5]

Gesticulator: A framework for semantically- aware speech-driven gesture generation,

T. Kucherenko, P. Jonell, S. Van Waveren, G. E. Henter, S. Alexandersson, I. Leite, and H. Kjellstr ¨om, “Gesticulator: A framework for semantically- aware speech-driven gesture generation,” inProceedings of the 2020 international conference on multimodal interaction, 2020, pp. 242–250

work page 2020

[6] [6]

Probabilistic human-like gesture synthesis from speech using gru-based wgan,

B. Wu, C. Liu, C. T. Ishi, and H. Ishiguro, “Probabilistic human-like gesture synthesis from speech using gru-based wgan,” inCompanion pub- lication of the 2021 international conference on multimodal interaction, 2021, pp. 194–201

work page 2021

[7] [7]

Impact of personality on generation of co-speech nonverbal behaviors represented by 3d skeleton pose,

R. Ishii, S. Eitoku, and Y . Sato, “Impact of personality on generation of co-speech nonverbal behaviors represented by 3d skeleton pose,” inProceedings of the 13th International Conference on Human-Agent Interaction, 2025, pp. 247–256

work page 2025

[8] [8]

The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice,

R. Plutchik, “The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice,”American scientist, vol. 89, no. 4, pp. 344–350, 2001

work page 2001

[9] [9]

Twifly: A data analysis framework for twitter,

P. Chatziadam, A. Dimitriadis, S. Gikas, I. Logothetis, M. Michalodim- itrakis, M. Neratzoulakis, A. Papadakis, V . Kontoulis, N. Siganos, D. Theodoropoulos, G. V ougioukalos, I. Hatzakis, G. Gerakis, N. Pa- padakis, and H. Kondylakis, “Twifly: A data analysis framework for twitter,”Information, vol. 11, no. 5, 2020

work page 2020

[10] [10]

Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,

Y . Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee, “Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 4303–4309

work page 2019

[11] [11]

A learning-based co-speech gesture generation system for social robots,

X. Li and C. Dondrup, “A learning-based co-speech gesture generation system for social robots,” inProceedings of the 12th International Conference on Human-Agent Interaction, 2024, pp. 453–455

work page 2024

[12] [12]

Evaluating the effect of co- speech gesture prediction on human–robot interaction,

E. Fern ´andez-Rodicio, J. J. Gamboa-Montero, M. Maroto-G ´omez, ´A. Castro-Gonz ´alez, and M. A. Salichs, “Evaluating the effect of co- speech gesture prediction on human–robot interaction,”International Journal of Human-Computer Studies, p. 103674, 2025

work page 2025

[13] [13]

Co-speech gesture and facial expression generation for non-photorealistic 3d characters,

T. Omine, N. Kawabata, and F. Homma, “Co-speech gesture and facial expression generation for non-photorealistic 3d characters,” in Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Posters, 2025, pp. 1–2

work page 2025

[14] [14]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,

H. Liu, Z. Zhu, N. Iwamoto, Y . Peng, Z. Li, Y . Zhou, E. Bozkurt, and B. Zheng, “Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” inEuropean conference on computer vision. Springer, 2022, pp. 612–630

work page 2022

[15] [15]

Sarges: Semantically aligned reliable gesture generation via intent chain,

N. Gao, Y . Bao, D. Weng, J. Zhao, J. Li, Y . Zhou, and P. Wan, “Sarges: Semantically aligned reliable gesture generation via intent chain,” in Proceedings of the International Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents, 2025, pp. 13–21

work page 2025

[16] [16]

Long short-term memory,

S. Hochreiter, “Long short-term memory,”Neural Computation MIT- Press, 1997

work page 1997

[17] [17]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[18] [18]

Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models,

S. Yang, Z. Wu, M. Li, Z. Zhang, L. Hao, W. Bao, M. Cheng, and L. Xiao, “Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models,”arXiv preprint arXiv:2305.04919, 2023

work page arXiv 2023

[19] [19]

Andorid robot motion generation based on video-recorded human demonstrations,

D.-S. Go, H.-J. Hyung, D.-W. Lee, and H. U. Yoon, “Andorid robot motion generation based on video-recorded human demonstrations,” in2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 2018, pp. 476–478

work page 2018

[20] [20]

Beat gesture generation rules for human-robot interaction,

P. Bremner, A. G. Pipe, M. Fraser, S. Subramanian, and C. Melhuish, “Beat gesture generation rules for human-robot interaction,” inRO-MAN 2009-the 18th IEEE international Symposium on Robot and human interactive communication. IEEE, 2009, pp. 1029–1034

work page 2009

[21] [21]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019

[22] [22]

Speech-gesture gan: Gesture generation for robots and embodied agents,

C. Y . Liu, G. Mohammadi, Y . Song, and W. Johal, “Speech-gesture gan: Gesture generation for robots and embodied agents,” in2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 2023, pp. 405–412

work page 2023

[23] [23]

Srg 3: Speech-driven robot gesture generation with gan,

C. Yu and A. Tapus, “Srg 3: Speech-driven robot gesture generation with gan,” in2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV). IEEE, 2020, pp. 759–766

work page 2020

[24] [24]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908

[25] [25]

Emo2Vec: Learning Generalized Emotion Representation by Multi-task Training

P. Xu, A. Madotto, C.-S. Wu, J. H. Park, and P. Fung, “Emo2vec: Learning generalized emotion representation by multi-task training,”arXiv preprint arXiv:1809.04505, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Efficient pain recognition via respiration signals: A single cross-attention transformer multi-window fusion pipeline,

S. Gkikas, I. Kyprakis, and M. Tsiknakis, “Efficient pain recognition via respiration signals: A single cross-attention transformer multi-window fusion pipeline,” inCompanion Proceedings of the 27th International Conference on Multimodal Interaction, ser. ICMI Companion ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 70–79

work page 2025

[27] [27]

Synthetic thermal and rgb videos for automatic pain assessment utilizing a vision-mlp architecture,

S. Gkikas and M. Tsiknakis, “Synthetic thermal and rgb videos for automatic pain assessment utilizing a vision-mlp architecture,” in2024 12th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), 2024, pp. 4–12

work page 2024

[28] [28]

A lightweight transformer for pain recognition from brain activity,

S. Gkikas, C. A. Cruz, Y . Fang, L. Cao, M. U. Khan, T. Kassiotis, G. Giannakakis, R. F. Rojas, and R. Gomez, “A lightweight transformer for pain recognition from brain activity,” 2026

work page 2026

[29] [29]

1bt: One-block transformer for eeg-based cognitive workload assessment,

S. Gkikas, C. A. Cruz, T. Kassiotis, G. Giannakakis, R. F. Rojas, and R. Gomez, “1bt: One-block transformer for eeg-based cognitive workload assessment,” 2026

work page 2026

[30] [30]

An Explainable Artificial Intelligence Approach for Detecting Empathy in Textual Communication,

E. C. Montiel-V ´azquez, J. A. Ram ´ırez Uresti, and O. Loyola-Gonz ´alez, “An Explainable Artificial Intelligence Approach for Detecting Empathy in Textual Communication,”Applied Sciences, vol. 12, no. 19, p. 9407, Sep. 2022

work page 2022

[31] [31]

Empatheticexchanges: Toward understanding the cues for empathy in dyadic conversations,

E. C. Montiel-V ´azquez, C. Arzate Cruz, J. A. R. Uresti, and R. Gomez, “Empatheticexchanges: Toward understanding the cues for empathy in dyadic conversations,”IEEE Access, vol. 12, pp. 195 097–195 110, 2024

work page 2024

[32] [32]

GPT-4 technical report

“GPT-4 technical report.” [Online]. Available: http://arxiv.org/abs/2303. 08774

work page

[33] [33]

Meaning and understanding in large language models,

V . Havl´ık, “Meaning and understanding in large language models,” Synthese, vol. 205, no. 1, p. 9, 2024

work page 2024

[34] [34]

Haru: Hardware design of an experimental tabletop robot assistant,

R. Gomez, D. Szapiro, K. Galindo, and K. Nakamura, “Haru: Hardware design of an experimental tabletop robot assistant,” inProceedings of the 2018 ACM/IEEE international conference on human-robot interaction, 2018, pp. 233–240

work page 2018

[35] [35]

A view on edge caching applications,

D. Antonogiorgakis, A. Britzolakis, P. Chatziadam, A. Dimitriadis, S. Gikas, E. Michalodimitrakis, M. Oikonomakis, N. Siganos, E. Tzagkarakis, Y . Nikoloudakis, S. Panagiotakis, E. Pallis, and E. K. Markakis, “A view on edge caching applications,” 2019. [Online]. Available: https://arxiv.org/abs/1907.12359

work page arXiv 2019

[36] [36]

Data augmentation for 3dmm-based arousal-valence prediction for hri,

C. A. Cruz, Y . Sechayk, T. Igarashi, and R. Gomez, “Data augmentation for 3dmm-based arousal-valence prediction for hri,” in2024 33rd IEEE International Conference on Robot and Human Interactive Communica- tion (ROMAN), 2024, pp. 2015–2022

work page 2024

[37] [37]

A visual perceptual perspective on gaze in social robotics,

R. S. Hessels and Y . Fang, “A visual perceptual perspective on gaze in social robotics,”Psychonomic Bulletin & Review, vol. 33, no. 4, p. 131, 2026

work page 2026