pith. sign in

arxiv: 2511.01233 · v4 · submitted 2025-11-03 · 💻 cs.CV · cs.GR· cs.HC

Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Pith reviewed 2026-05-18 01:36 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.HC
keywords gesture generationhuman evaluationspeech-driven gesturesBEAT2 datasetmotion realismspeech-gesture alignmentbenchmarkingcrowdsourced evaluation
0
0 comments X

The pith

Standardized human evaluations show motion realism has saturated for gesture generation models on the BEAT2 dataset while speech alignment claims fail to hold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how human evaluations of speech-driven 3D gesture generators have lacked consistent methods, making it hard to compare approaches or identify real progress. To fix this, the authors create a detailed protocol for the popular BEAT2 motion-capture dataset and use it to run large crowdsourced tests on six recent models trained by their original teams. The tests separate two qualities: how natural the movements look on their own, and how well they match the spoken words. Results indicate that realism scores no longer distinguish newer models from older ones, and alignment scores are lower than earlier studies suggested even for models built specifically for that goal. The work releases rendered videos and human votes so others can test without rebuilding models, and it argues that future benchmarking needs these two qualities measured apart.

Core claim

Applying the new protocol across six author-trained models on BEAT2 reveals that motion realism has become saturated, with older models matching recent ones, while prior reports of strong speech-gesture alignment do not survive rigorous pairwise testing; therefore accurate progress requires separate measurement of motion quality and multimodal alignment rather than combined scores.

What carries the argument

The crowdsourced human evaluation protocol that disentangles motion realism from speech-gesture alignment through large-scale pairwise preference votes on rendered video stimuli from the BEAT2 dataset.

If this is right

  • Motion realism can no longer serve as a useful benchmark on BEAT2 because older and newer models perform on par.
  • Claims of high speech-gesture alignment from earlier work do not replicate under controlled conditions even for models designed for alignment.
  • Benchmarking must separate motion quality from multimodal alignment to avoid misleading combined scores.
  • The released five hours of synthetic motion and 750+ video stimuli enable new studies without requiring model reimplementation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future model development could shift focus toward improving alignment while preserving the already high realism baseline.
  • The released preference votes and rendering script create a reusable testbed that other multimodal generation fields might adapt for their own evaluation standards.
  • If the saturation finding generalizes, research resources may move away from pure realism metrics toward timing, semantics, or style control in gestures.

Load-bearing premise

The crowdsourced protocol itself introduces no new biases from the participant pool or platform that would distort the rankings of realism and alignment.

What would settle it

A replication of the same pairwise votes using a different crowdsourcing platform or screened participant group that produces substantially different model rankings or restores high alignment scores for specialized models.

Figures

Figures reproduced from arXiv: 2511.01233 by (10) Trinity College Dublin, (11) University of California, (12) SEED -- Electronic Arts, (13) Electronics, (2) Bielefeld University, (3) University of Science -- VNUHCM, (4) Independent Researcher, 5) ((1) KTH Royal Institute of Technology, (5) Motorica AB, (6) Peking University, (7) Huawei Technologies Ltd., (8) Astribot, (9) Max-Planck Institute for Informatics, Christian Theobalt (9), Davis, Gustav Eje Henter (1, Hendric Voss (2), Kiran Chhatre (1), Libin Liu (6), M. Hamza Mughal (9), Michael Neff (11), Mihail Tsakov (4), Rachel McDonnell (10), Rajmund Nagy (1), Rishabh Dabral (9), Shaoli Huang (8), SIC, Sicheng Yang (7), Stefan Kopp (2), Taras Kucherenko (12), Telecommunications Research Institute (ETRI)), Tenglong Ao (6), Teodor Nikolov (5), Thanh Hoang-Minh (3), Yongkang Cheng (8), Youngwoo Yoon (13), Zeyi Zhang (6).

Figure 1
Figure 1. Figure 1: Direct comparisons are exceedingly rare between state [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example embodiments used in recent evaluations [ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results of the motion-realism user study, in the form of Elo ratings for each condition considered and 95% confidence intervals [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results of the speech-gesture appropriateness user study. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Questions and response options in the two types of user studies, also showing their schematic layout in the user-study GUI. For [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A screenshot of the GUI for the user studies, specifically from a motion-realism test with the current screen containing stimulus [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Frequency of JUICE options chosen for each model dur [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: A video frame showing a gesturing SMPL-X avatar [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

We review human evaluation practices in automatic, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results show that 1) motion realism has become a saturated evaluation measure on the BEAT2 dataset, with older models performing on par with more recent approaches; 2) previous findings of high speech-gesture alignment do not hold up under rigorous evaluation, even for specialised models; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. To drive standardisation and enable new evaluation research, we release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without requiring model reimplementation -- alongside our open-source rendering script, and 16,000 pairwise human preference votes collected for our benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript reviews human evaluation practices in automatic speech-driven 3D gesture generation, identifying a lack of standardization and frequent use of flawed experimental setups that prevent reliable comparisons or identification of the state of the art. To address these issues, the authors introduce a detailed human evaluation protocol for the BEAT2 motion-capture dataset. They apply this protocol in a large-scale crowdsourced study ranking six recent gesture-generation models (each trained by its original authors) on two dimensions: motion realism and speech-gesture alignment. Results indicate that motion realism has become saturated on BEAT2 (older models perform on par with recent ones), that prior findings of high speech-gesture alignment do not replicate under rigorous evaluation, and that the field should adopt disentangled assessments of motion quality and multimodal alignment. The authors release five hours of synthetic motion, over 750 rendered video stimuli, an open-source rendering script, and 16,000 pairwise human preference votes to support standardization and future research.

Significance. If the protocol proves robust, the work could meaningfully advance benchmarking standards in gesture generation by providing a reproducible protocol and releasing extensive resources (model outputs, video stimuli, rendering code, and a large set of human votes). These releases are a clear strength for reproducibility and enable new evaluations without model reimplementation. The findings on saturation and alignment replication could prompt the community to move beyond saturated or confounded metrics, though this depends on addressing the protocol's documentation.

major comments (2)
  1. [§4 (Evaluation Protocol)] §4 (Evaluation Protocol): The manuscript provides limited detail on participant filtering, attention checks, demographic controls, and any calibration against expert or lab-based raters. Because the headline claims of motion realism saturation on BEAT2 and non-replication of prior alignment results rest on the crowdsourced protocol producing unbiased rankings, these aspects require fuller specification to substantiate the conclusions.
  2. [§5 (Results)] §5 (Results): The evidence for saturation (older models performing on par with recent ones) and the alignment findings should include explicit statistical tests, p-values, effect sizes, or confidence intervals in the relevant tables or figures. Without these, it is difficult to assess whether the observed parity or differences are statistically meaningful or merely due to variance in the crowdsourced data.
minor comments (3)
  1. Figure captions should explicitly describe what error bars represent (e.g., standard error or 95% CI) and clarify the exact comparison being shown in each panel.
  2. Notation for the two evaluation dimensions (motion realism vs. speech-gesture alignment) should be used consistently throughout the text and figures to avoid ambiguity.
  3. [Related Work] The related-work section would benefit from citing any very recent (post-2023) gesture-generation papers that also discuss evaluation practices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional details and analyses where feasible, thereby strengthening the transparency and statistical rigor of our work.

read point-by-point responses
  1. Referee: [§4 (Evaluation Protocol)] §4 (Evaluation Protocol): The manuscript provides limited detail on participant filtering, attention checks, demographic controls, and any calibration against expert or lab-based raters. Because the headline claims of motion realism saturation on BEAT2 and non-replication of prior alignment results rest on the crowdsourced protocol producing unbiased rankings, these aspects require fuller specification to substantiate the conclusions.

    Authors: We appreciate the referee's emphasis on rigorous documentation of the crowdsourcing procedure. Section 4 of the manuscript already describes the use of attention checks, basic participant filtering based on response quality, and collection of demographic information as part of the protocol for the BEAT2 dataset. To address this comment directly, we will expand the section with more granular specifications of the filtering thresholds, attention check design, and demographic breakdowns. Regarding calibration against expert or lab-based raters, the study did not include a direct comparison; we followed common practices for large-scale crowdsourced evaluations in generative modeling. In the revision we will add an explicit discussion of this design choice, its alignment with prior literature, and any associated limitations. revision: partial

  2. Referee: [§5 (Results)] §5 (Results): The evidence for saturation (older models performing on par with recent ones) and the alignment findings should include explicit statistical tests, p-values, effect sizes, or confidence intervals in the relevant tables or figures. Without these, it is difficult to assess whether the observed parity or differences are statistically meaningful or merely due to variance in the crowdsourced data.

    Authors: We agree that explicit statistical support is important for interpreting the saturation and alignment results. The manuscript currently presents preference rankings and percentages from the 16,000 votes. In the revised version we will augment the relevant tables and figures in Section 5 with appropriate non-parametric statistical tests (e.g., Friedman test followed by post-hoc Wilcoxon signed-rank tests with correction), p-values, effect sizes (rank-biserial correlation), and 95% confidence intervals for the key comparisons. These additions will be computed from the existing preference data and will clarify the statistical meaningfulness of the observed model parity in motion realism and the alignment findings. revision: yes

Circularity Check

0 steps flagged

New crowdsourced human preference data yields independent benchmark results

full rationale

The paper reviews prior evaluation practices, introduces a detailed protocol for the BEAT2 dataset, and reports results from a fresh large-scale crowdsourced study collecting 16,000 pairwise votes on motion realism and speech-gesture alignment for six models. The headline claims (saturation of realism, failure of prior alignment findings to replicate, need for disentangled metrics) are direct empirical outcomes of this new data collection and protocol application, not reductions of fitted parameters, self-definitions, or self-citation chains. The work is self-contained against external benchmarks via released stimuli and votes; no load-bearing step equates to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's claims rest on the validity of their new evaluation protocol and the representativeness of the crowdsourced data.

axioms (1)
  • domain assumption Human judgments in crowdsourced pairwise comparisons reliably reflect perceived motion realism and speech-gesture alignment.
    This underpins the entire evaluation protocol and results.

pith-pipeline@v0.9.0 · 6042 in / 1317 out tokens · 43722 ms · 2026-05-18T01:36:15.450863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

    cs.CV 2025-12 unverdicted novelty 7.0

    ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...

  2. Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures

    cs.GR 2026-05 unverdicted novelty 6.0

    Avatar and face representation systematically shift perceptual judgments of synthesized co-speech gestures.

  3. Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures

    cs.GR 2026-05 unverdicted novelty 5.0

    Avatar appearance and facial presentation systematically bias perceptual judgments of synthesized co-speech gestures.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 2 Pith papers

  1. [1]

    Generative AI for character animation: A comprehensive survey of tech- niques, applications, and future directions.arXiv preprint arXiv:2504.19056, 2025

    Mohammad Mahdi Abootorabi, Omid Ghahroodi, Par- dis Sadat Zahraei, Hossein Behzadasl, Alireza Mirrokni, Mobina Salimipanah, Arash Rasouli, Bahar Behzadipour, Sara Azarnoush, Benyamin Maleki, et al. Generative AI for character animation: A comprehensive survey of tech- niques, applications, and future directions.arXiv preprint arXiv:2504.19056, 2025. 1

  2. [2]

    No gestures left behind: Learning rela- tionships between spoken language and freeform gestures

    Chaitanya Ahuja, Dong Won Lee, Ryo Ishii, and Louis- Philippe Morency. No gestures left behind: Learning rela- tionships between spoken language and freeform gestures. InProceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing: Findings, pages 1884– 1895, 2020. 3

  3. [3]

    Continual learning for personalized co-speech gesture generation

    Chaitanya Ahuja, Pratik Joshi, Ryo Ishii, and Louis-Philippe Morency. Continual learning for personalized co-speech gesture generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 20893–20903, 2023. 3

  4. [4]

    Style-controllable speech-driven gesture synthesis using normalising flows

    Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. Style-controllable speech-driven gesture synthesis using normalising flows. InComputer Graphics Forum, pages 487–496. Wiley Online Library, 2020. 2

  5. [5]

    Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023

    Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023. 2, 3, 5, 22

  6. [6]

    GestureDiffu- CLIP: Gesture diffusion model with CLIP latents.ACM Transactions on Graphics (TOG), 42(4):1–18, 2023

    Tenglong Ao, Zeyi Zhang, and Libin Liu. GestureDiffu- CLIP: Gesture diffusion model with CLIP latents.ACM Transactions on Graphics (TOG), 42(4):1–18, 2023. 2, 3, 5

  7. [7]

    Kenneth J. Arrow. A difficulty in the concept of social wel- fare.Journal of Political Economy, 58(4):328–346, 1950. 9

  8. [8]

    Why spiderman is such a good dancer.https : / / web

    Jody Avirgan. Why spiderman is such a good dancer.https : / / web . archive . org / web / 20201112011116/https://www.wnycstudios. org / podcasts / radiolab / articles / 299399 - why-spiderman-such-good-dancer, 2013. 9

  9. [9]

    wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 21

  10. [10]

    On the adaptive con- trol of the false discovery rate in multiple testing with inde- 10 pendent statistics.J

    Yoav Benjamini and Yosef Hochberg. On the adaptive con- trol of the false discovery rate in multiple testing with inde- 10 pendent statistics.J. Educ. Behav. Stat., 25(1):60–83, 2000. 17

  11. [11]

    Elo uncovered: Robustness and best practices in language model evaluation.Advances in Neural Information Processing Systems, 37:106135–106161, 2024

    Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, and Marzieh Fadaee. Elo uncovered: Robustness and best practices in language model evaluation.Advances in Neural Information Processing Systems, 37:106135–106161, 2024. 9

  12. [12]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 6, 16

  13. [13]

    Towards better user studies in com- puter graphics and vision.Foundations and Trends® in Com- puter Graphics and Vision, 15(3):201–252, 2023

    Zoya Bylinskii, Laura Herman, Aaron Hertzmann, Stefanie Hutka, Yile Zhang, et al. Towards better user studies in com- puter graphics and vision.Foundations and Trends® in Com- puter Graphics and Vision, 15(3):201–252, 2023. 1

  14. [14]

    A V-Flow: Transforming text to audio-visual human-like interactions.arXiv preprint arXiv:2502.13133,

    Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollh ¨ofer, Dimitris Samaras, and Alexan- der Richard. A V-Flow: Transforming text to audio-visual human-like interactions.arXiv preprint arXiv:2502.13133,

  15. [15]

    Motion-example-controlled co-speech ges- ture generation leveraging large language models

    Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, and Kun Zhou. Motion-example-controlled co-speech ges- ture generation leveraging large language models. InPro- ceedings of the Special Interest Group on Computer Graph- ics and Interactive Techniques Conference Conference Pa- pers, New York, NY , USA, 2025. Association for Computing Machinery. 3

  16. [16]

    The language of motion: Unifying verbal and non-verbal language of 3d human motion

    Changan Chen, Juze Zhang, Shrinidhi Kowshika Laksh- mikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei- Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3

  17. [17]

    Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and ges- ture generation

    Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and ges- ture generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024. 3, 5

  18. [18]

    Hop: Heterogeneous topology-based mul- timodal entanglement for co-speech gesture generation

    Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, and Yanwei Fu. Hop: Heterogeneous topology-based mul- timodal entanglement for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025. 3, 5

  19. [19]

    Siggesture: Gen- eralized co-speech gesture synthesis via semantic injection with large-scale pre-training diffusion models

    Qingrong Cheng, Xu Li, and Xinghui Fu. Siggesture: Gen- eralized co-speech gesture synthesis via semantic injection with large-scale pre-training diffusion models. InSIG- GRAPH Asia 2024 Conference Papers, New York, NY , USA,

  20. [20]

    Association for Computing Machinery. 3

  21. [21]

    HoloGest: Decoupled diffusion and motion priors for generating holisticly expres- sive co-speech gestures

    Yongkang Cheng and Shaoli Huang. HoloGest: Decoupled diffusion and motion priors for generating holisticly expres- sive co-speech gestures. InProceedings of the International Conference on 3D Vision, 2025. 7, 8, 22

  22. [22]

    Black, and Timo Bolkart

    Kiran Chhatre, Radek Dan ˇeˇcek, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, and Timo Bolkart. AMUSE: Emotional speech-driven 3D body ani- mation via disentangled latent diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1942–1953, 2024. 2, 3, 5, 7, 20, 22

  23. [23]

    Gonzalez, and Ion Stoica

    Wei-Lin Chiang, Tim Li, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: New models & Elo system up- date.https://lmsys.org/blog/2023- 12- 07- leaderboard/, 2023. Accessed: 2025-05-20. 16

  24. [24]

    Effectively unbiased FID and Inception score and where to find them

    Min Jin Chong and David Forsyth. Effectively unbiased FID and Inception score and where to find them. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6070–6079, 2020. 23

  25. [25]

    Investigating range- equalizing bias in mean opinion score ratings of synthesized speech

    Erica Cooper and Junichi Yamagishi. Investigating range- equalizing bias in mean opinion score ratings of synthesized speech. InProc. Interspeech, pages 1104–1108, 2023. 4

  26. [26]

    Advancing objective evaluation of speech-driven gesture generation for embodied conversational agents.International Journal of Human–Computer Interaction, 0(0):1–17, 2025

    Karlo Crnek, Grega Mo ˇcnik, and Matej Rojc. Advancing objective evaluation of speech-driven gesture generation for embodied conversational agents.International Journal of Human–Computer Interaction, 0(0):1–17, 2025. 2, 10

  27. [27]

    Mofusion: A framework for denoising-diffusion-based motion synthesis

    Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion-based motion synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2

  28. [28]

    Diffusion-based co-speech gesture genera- tion using joint text and audio representation

    Anna Deichler, Shivam Mehta, Simon Alexanderson, and Jonas Beskow. Diffusion-based co-speech gesture genera- tion using joint text and audio representation. InProceedings of the International Conference on Multimodal Interaction, pages 755–762, 2023. 2

  29. [29]

    Arpad E. Elo. The proposed USCF rating system, its devel- opment, theory, and applications.Chess Life, 22(8):242–247,

  30. [30]

    See- ing is believing: body motion dominates in multisensory conversations.ACM Transactions on Graphics (TOG), 29 (4):1–9, 2010

    Cathy Ennis, Rachel McDonnell, and Carol O’Sullivan. See- ing is believing: body motion dominates in multisensory conversations.ACM Transactions on Graphics (TOG), 29 (4):1–9, 2010. 4

  31. [31]

    Investigating the use of recurrent motion modelling for speech gesture generation

    Ylva Ferstl and Rachel McDonnell. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the ACM International Conference on Intel- ligent Virtual Agents, pages 93–98, 2018. 2, 3

  32. [32]

    Zeroeggs: Zero-shot example-based gesture generation from speech

    Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F Troje, and Marc-Andr ´e Carbonneau. Zeroeggs: Zero-shot example-based gesture generation from speech. InCom- puter Graphics Forum, pages 206–216. Wiley Online Li- brary, 2023. 2, 3, 21

  33. [33]

    Factorizing text-to-video generation by explicit image conditioning

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. InProceedings of the European Conference on Computer Vision, pages 205– 224, 2024. 6

  34. [34]

    wild west

    Kazi Injamamul Haque, Alkiviadis Pavlou, and Zerrin Yu- mak. “wild west” of evaluating speech-driven 3d facial ani- mation synthesis: A benchmark study. InComputer Graph- ics Forum, page e70073. Wiley Online Library, 2025. 2

  35. [35]

    Evaluation of speech-to-gesture generation using bi-directional LSTM network

    Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. Evaluation of speech-to-gesture generation using bi-directional LSTM network. InProceed- ings of the ACM International Conference on Intelligent Vir- tual Agents, pages 79–86, New York, NY , USA, 2018. ACM. 2 11

  36. [36]

    Automatic quality assessment of speech-driven synthesized gestures.International Journal of Computer Games Technology, 2022, 2022

    Zhiyuan He. Automatic quality assessment of speech-driven synthesized gestures.International Journal of Computer Games Technology, 2022, 2022. 10

  37. [37]

    The curse of performative user studies

    Aaron Hertzmann. The curse of performative user studies. IEEE Computer Graphics and Applications, 43(6):112–116,

  38. [38]

    Establishing a uni- fied evaluation framework for human motion generation: A comparative analysis of metrics.Computer Vision and Image Understanding, 254:104337, 2025

    Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, and Germain Forestier. Establishing a uni- fied evaluation framework for human motion generation: A comparative analysis of metrics.Computer Vision and Image Understanding, 254:104337, 2025. 10

  39. [39]

    Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic set- tings

    Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic set- tings. InProceedings of the ACM International Conference on Intelligent Virtual Agents, 2020. 4, 7

  40. [40]

    Maurice George Kendall.Rank correlation methods.Griffin,

  41. [41]

    Analyzing input and output representations for speech-driven gesture gener- ation

    Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellstr ¨om. Analyzing input and output representations for speech-driven gesture gener- ation. InProceedings of the ACM International Conference on Intelligent Virtual Agents, pages 97–104, New York, NY , USA, 2019. ACM. 2

  42. [42]

    Gesticulator: A framework for semantically-aware speech-driven gesture generation

    Taras Kucherenko, Patrik Jonell, Sanne Van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellstr ¨om. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Mul- timodal Interaction, pages 242–250, 2020

  43. [43]

    Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gus- tav Eje Henter, and Hedvig Kjellstr ¨om. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation.Interna- tional Journal of Human-Computer Interaction, 37(14): 1300–1316, 2021. 2

  44. [44]

    A large, crowdsourced eval- uation of gesture generation systems on common data: The genea challenge 2020

    Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. A large, crowdsourced eval- uation of gesture generation systems on common data: The genea challenge 2020. In26th international conference on intelligent user interfaces, pages 11–21, 2021. 2, 4, 15, 20

  45. [45]

    The GENEA Challenge 2023: A large- scale evaluation of gesture generation models in monadic and dyadic settings

    Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, and Gus- tav Eje Henter. The GENEA Challenge 2023: A large- scale evaluation of gesture generation models in monadic and dyadic settings. InProceedings of the International Con- ference on Multimodal Interaction, pages 792–801, 2023. 4, 7

  46. [46]

    Evaluating gesture generation in a large-scale open chal- lenge: The GENEA Challenge 2022.ACM Transactions on Graphics (TOG), 2024

    Taras Kucherenko, Pieter Wolfert, Youngwoo Yoon, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Hen- ter. Evaluating gesture generation in a large-scale open chal- lenge: The GENEA Challenge 2022.ACM Transactions on Graphics (TOG), 2024. 2, 4, 7, 15, 19, 23

  47. [47]

    Srinivasa, and Yaser Sheikh

    Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. Talking With Hands 16.2 M: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772,

  48. [48]

    Ross, and Angjoo Kanazawa

    Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3D dance generation with AIST++. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13401–13412, 2021. 22

  49. [49]

    BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InProceedings of the European Conference on Computer Vision, pages 612– 630, 2022. 3, 22

  50. [50]

    Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. EMAGE: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1144–1154, 2024. 1, 3...

  51. [51]

    Semges: Semantics-aware co-speech gesture gener- ation using semantic coherence and relevance learning

    Lanmiao Liu, Esam Ghaleb, Aslı ¨Ozy¨urek, and Zerrin Yu- mak. Semges: Semantics-aware co-speech gesture gener- ation using semantic coherence and relevance learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 3

  52. [52]

    Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling

    Pinxin Liu, Luchuan Song, Junhua Huang, and Chenliang Xu. Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025. 3

  53. [53]

    Learning hierarchical cross-modal association for co- speech gesture generation

    Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. Learning hierarchical cross-modal association for co- speech gesture generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022. 2, 22

  54. [54]

    Speech-based gesture generation for robots and embodied agents: A scoping review

    Yu Liu, Gelareh Mohammadi, Yang Song, and Wafa Johal. Speech-based gesture generation for robots and embodied agents: A scoping review. InProceedings of the Interna- tional Conference on Human-Agent Interaction, pages 31– 38, 2021. 1

  55. [55]

    Towards variable and coordinated holistic co-speech motion generation

    Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. Towards variable and coordinated holistic co-speech motion generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1566–1576, 2024. 3

  56. [56]

    Render me real? investigating the effect of render style on the perception of animated virtual humans.ACM Transac- tions on Graphics (TOG), 31(4):1–11, 2012

    Rachel McDonnell, Martin Breidt, and Heinrich H B ¨ulthoff. Render me real? investigating the effect of render style on the perception of animated virtual humans.ACM Transac- tions on Graphics (TOG), 31(4):1–11, 2012. 5

  57. [57]

    Miller, Laura A

    Jared E. Miller, Laura A. Carlson, and J. Devin McAuley. When what you hear influences when you see: listening to an auditory rhythm influences the temporal allocation of visual attention.Psychological Science, 24(1):11–18, 2013. 9

  58. [58]

    Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis

    Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, and Christian 12 Theobalt. Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 7, 8, 21

  59. [59]

    Hamza Mughal, Rishabh Dabral, Merel C

    M. Hamza Mughal, Rishabh Dabral, Merel C. J. Scholman, Vera Demberg, and Christian Theobalt. Retrieving semantics from the deep: an rag solution for gesture synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3, 4, 5, 7, 8, 17, 21

  60. [60]

    Towards a genea leaderboard–an extended, living benchmark for evaluating and advancing conversational mo- tion synthesis.arXiv preprint arXiv:2410.06327, 2024

    Rajmund Nagy, Hendric V oss, Youngwoo Yoon, Taras Kucherenko, Teodor Nikolov, Thanh Hoang-Minh, Rachel McDonnell, Stefan Kopp, Michael Neff, and Gustav Eje Henter. Towards a genea leaderboard–an extended, living benchmark for evaluating and advancing conversational mo- tion synthesis.arXiv preprint arXiv:2410.06327, 2024. 10

  61. [61]

    From audio to photoreal embodiment: Synthesizing humans in conversations

    Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, and Alexander Richard. From audio to photoreal embodiment: Synthesizing humans in conversations. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 5, 18, 19, 22

  62. [62]

    A comprehensive re- view of data-driven co-speech gesture generation

    Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. A comprehensive re- view of data-driven co-speech gesture generation. InCom- puter Graphics Forum, pages 569–596. Wiley Online Li- brary, 2023. 1, 2, 17

  63. [63]

    Bodyformer: Semantics-guided 3d body gesture synthesis with transformer.ACM Transactions on Graphics (TOG), 42(4):1–12, 2023

    Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki Shiratori, Junichi Yamagishi, and Taku Komura. Bodyformer: Semantics-guided 3d body gesture synthesis with transformer.ACM Transactions on Graphics (TOG), 42(4):1–12, 2023. 3, 5, 20

  64. [64]

    Expressive body capture: 3d hands, face, and body from a single image

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 6

  65. [65]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019. 20

  66. [66]

    The blizzard challenge 2023

    Olivier Perrotin, Brooke Stephenson, Silvain Gerber, and G´erard Bailly. The blizzard challenge 2023. In18th Bliz- zard Challenge Workshop, pages 1–27. ISCA, 2023. 15

  67. [67]

    Schae- fer, and Geraint A

    Wim Pouw, Shannon Proksch, Linda Drijvers, Marco Gamba, Judith Holler, Christopher Kello, Rebecca S. Schae- fer, and Geraint A. Wiggins. Multilevel rhythms in multi- modal communication.P . Roy. Soc. B, 376(1835), 2021. 9

  68. [68]

    Weakly-supervised emotion tran- sition learning for diverse 3d co-speech gesture generation

    Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, and Yike Guo. Weakly-supervised emotion tran- sition learning for diverse 3d co-speech gesture generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10424–10434, 2024. 3, 5

  69. [69]

    Passing a non-verbal Turing test: Evaluating gesture anima- tions generated from speech

    Manuel Rebol, Christian G ¨uti, and Krzysztof Pietroszek. Passing a non-verbal Turing test: Evaluating gesture anima- tions generated from speech. InProceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces, pages 573–581. IEEE, 2021. 4, 7

  70. [70]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 22

  71. [71]

    The importance of quali- tative elements in subjective evaluation of semantic gestures

    Carolyn Saund and Stacy Marsella. The importance of quali- tative elements in subjective evaluation of semantic gestures. In2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pages 1–8. IEEE,

  72. [72]

    Co-speech ges- ture synthesis by reinforcement learning with contrastive pre-trained rewards

    Mingyang Sun, Mengchen Zhao, Yaqing Hou, Minglei Li, Huang Xu, Songcen Xu, and Jianye Hao. Co-speech ges- ture synthesis by reinforcement learning with contrastive pre-trained rewards. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2331–2340, 2023. 3

  73. [73]

    Speech-to- gesture generation: A challenge in deep learning approach with bi-directional LSTM

    Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. Speech-to- gesture generation: A challenge in deep learning approach with bi-directional LSTM. InProceedings of the Interna- tional Conference on Human Agent Interaction, 2017. 2

  74. [74]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the International Conference on Machine Learning, pages 10347–10357. PMLR, 2021. 22

  75. [75]

    EDGE: Editable dance generation from music

    Jonathan Tseng, Rodrigo Castellon, and Karen Liu. EDGE: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023. 2

  76. [76]

    Aq-gt: a temporally aligned and quantized gru-transformer for co-speech gesture synthe- sis

    Hendric V oß and Stefan Kopp. Aq-gt: a temporally aligned and quantized gru-transformer for co-speech gesture synthe- sis. InProceedings of the 25th International Conference on Multimodal Interaction, pages 60–69, 2023. 2

  77. [77]

    Girard, Taras Kucherenko, and Tony Belpaeme

    Pieter Wolfert, Jeffrey M. Girard, Taras Kucherenko, and Tony Belpaeme. To rate or not to rate: Investigating eval- uation methods for generated co-speech gestures. InProc. ICMI, pages 494–502. ACM, 2021. 6

  78. [78]

    A re- view of evaluation practices of gesture generation in embod- ied conversational agents.IEEE Transactions on Human- Machine Systems, 52(3):379–389, 2022

    Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. A re- view of evaluation practices of gesture generation in embod- ied conversational agents.IEEE Transactions on Human- Machine Systems, 52(3):379–389, 2022. 2

  79. [79]

    Probabilistic speech- driven 3d facial motion synthesis: new benchmarks meth- ods and applications

    Karren D Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, and Oncel Tuzel. Probabilistic speech- driven 3d facial motion synthesis: new benchmarks meth- ods and applications. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 27294–27303, 2024. 2

  80. [80]

    Ges- turehydra: Semantic co-speech gesture synthesis via hybrid modality diffusion transformer and cascaded-synchronized retrieval-augmented generation

    Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Lingyun Yu, Yingying Li, Haocheng Feng, Hang Zhou, and Hongtao Xie. Ges- turehydra: Semantic co-speech gesture synthesis via hybrid modality diffusion transformer and cascaded-synchronized retrieval-augmented generation. InProceedings of the 13 IEEE/CVF International Conferen...

Showing first 80 references.