Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark
Pith reviewed 2026-05-18 01:36 UTC · model grok-4.3
The pith
Standardized human evaluations show motion realism has saturated for gesture generation models on the BEAT2 dataset while speech alignment claims fail to hold.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying the new protocol across six author-trained models on BEAT2 reveals that motion realism has become saturated, with older models matching recent ones, while prior reports of strong speech-gesture alignment do not survive rigorous pairwise testing; therefore accurate progress requires separate measurement of motion quality and multimodal alignment rather than combined scores.
What carries the argument
The crowdsourced human evaluation protocol that disentangles motion realism from speech-gesture alignment through large-scale pairwise preference votes on rendered video stimuli from the BEAT2 dataset.
If this is right
- Motion realism can no longer serve as a useful benchmark on BEAT2 because older and newer models perform on par.
- Claims of high speech-gesture alignment from earlier work do not replicate under controlled conditions even for models designed for alignment.
- Benchmarking must separate motion quality from multimodal alignment to avoid misleading combined scores.
- The released five hours of synthetic motion and 750+ video stimuli enable new studies without requiring model reimplementation.
Where Pith is reading between the lines
- Future model development could shift focus toward improving alignment while preserving the already high realism baseline.
- The released preference votes and rendering script create a reusable testbed that other multimodal generation fields might adapt for their own evaluation standards.
- If the saturation finding generalizes, research resources may move away from pure realism metrics toward timing, semantics, or style control in gestures.
Load-bearing premise
The crowdsourced protocol itself introduces no new biases from the participant pool or platform that would distort the rankings of realism and alignment.
What would settle it
A replication of the same pairwise votes using a different crowdsourcing platform or screened participant group that produces substantially different model rankings or restores high alignment scores for specialized models.
Figures
read the original abstract
We review human evaluation practices in automatic, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results show that 1) motion realism has become a saturated evaluation measure on the BEAT2 dataset, with older models performing on par with more recent approaches; 2) previous findings of high speech-gesture alignment do not hold up under rigorous evaluation, even for specialised models; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. To drive standardisation and enable new evaluation research, we release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without requiring model reimplementation -- alongside our open-source rendering script, and 16,000 pairwise human preference votes collected for our benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reviews human evaluation practices in automatic speech-driven 3D gesture generation, identifying a lack of standardization and frequent use of flawed experimental setups that prevent reliable comparisons or identification of the state of the art. To address these issues, the authors introduce a detailed human evaluation protocol for the BEAT2 motion-capture dataset. They apply this protocol in a large-scale crowdsourced study ranking six recent gesture-generation models (each trained by its original authors) on two dimensions: motion realism and speech-gesture alignment. Results indicate that motion realism has become saturated on BEAT2 (older models perform on par with recent ones), that prior findings of high speech-gesture alignment do not replicate under rigorous evaluation, and that the field should adopt disentangled assessments of motion quality and multimodal alignment. The authors release five hours of synthetic motion, over 750 rendered video stimuli, an open-source rendering script, and 16,000 pairwise human preference votes to support standardization and future research.
Significance. If the protocol proves robust, the work could meaningfully advance benchmarking standards in gesture generation by providing a reproducible protocol and releasing extensive resources (model outputs, video stimuli, rendering code, and a large set of human votes). These releases are a clear strength for reproducibility and enable new evaluations without model reimplementation. The findings on saturation and alignment replication could prompt the community to move beyond saturated or confounded metrics, though this depends on addressing the protocol's documentation.
major comments (2)
- [§4 (Evaluation Protocol)] §4 (Evaluation Protocol): The manuscript provides limited detail on participant filtering, attention checks, demographic controls, and any calibration against expert or lab-based raters. Because the headline claims of motion realism saturation on BEAT2 and non-replication of prior alignment results rest on the crowdsourced protocol producing unbiased rankings, these aspects require fuller specification to substantiate the conclusions.
- [§5 (Results)] §5 (Results): The evidence for saturation (older models performing on par with recent ones) and the alignment findings should include explicit statistical tests, p-values, effect sizes, or confidence intervals in the relevant tables or figures. Without these, it is difficult to assess whether the observed parity or differences are statistically meaningful or merely due to variance in the crowdsourced data.
minor comments (3)
- Figure captions should explicitly describe what error bars represent (e.g., standard error or 95% CI) and clarify the exact comparison being shown in each panel.
- Notation for the two evaluation dimensions (motion realism vs. speech-gesture alignment) should be used consistently throughout the text and figures to avoid ambiguity.
- [Related Work] The related-work section would benefit from citing any very recent (post-2023) gesture-generation papers that also discuss evaluation practices.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional details and analyses where feasible, thereby strengthening the transparency and statistical rigor of our work.
read point-by-point responses
-
Referee: [§4 (Evaluation Protocol)] §4 (Evaluation Protocol): The manuscript provides limited detail on participant filtering, attention checks, demographic controls, and any calibration against expert or lab-based raters. Because the headline claims of motion realism saturation on BEAT2 and non-replication of prior alignment results rest on the crowdsourced protocol producing unbiased rankings, these aspects require fuller specification to substantiate the conclusions.
Authors: We appreciate the referee's emphasis on rigorous documentation of the crowdsourcing procedure. Section 4 of the manuscript already describes the use of attention checks, basic participant filtering based on response quality, and collection of demographic information as part of the protocol for the BEAT2 dataset. To address this comment directly, we will expand the section with more granular specifications of the filtering thresholds, attention check design, and demographic breakdowns. Regarding calibration against expert or lab-based raters, the study did not include a direct comparison; we followed common practices for large-scale crowdsourced evaluations in generative modeling. In the revision we will add an explicit discussion of this design choice, its alignment with prior literature, and any associated limitations. revision: partial
-
Referee: [§5 (Results)] §5 (Results): The evidence for saturation (older models performing on par with recent ones) and the alignment findings should include explicit statistical tests, p-values, effect sizes, or confidence intervals in the relevant tables or figures. Without these, it is difficult to assess whether the observed parity or differences are statistically meaningful or merely due to variance in the crowdsourced data.
Authors: We agree that explicit statistical support is important for interpreting the saturation and alignment results. The manuscript currently presents preference rankings and percentages from the 16,000 votes. In the revised version we will augment the relevant tables and figures in Section 5 with appropriate non-parametric statistical tests (e.g., Friedman test followed by post-hoc Wilcoxon signed-rank tests with correction), p-values, effect sizes (rank-biserial correlation), and 95% confidence intervals for the key comparisons. These additions will be computed from the existing preference data and will clarify the statistical meaningfulness of the observed model parity in motion realism and the alignment findings. revision: yes
Circularity Check
New crowdsourced human preference data yields independent benchmark results
full rationale
The paper reviews prior evaluation practices, introduces a detailed protocol for the BEAT2 dataset, and reports results from a fresh large-scale crowdsourced study collecting 16,000 pairwise votes on motion realism and speech-gesture alignment for six models. The headline claims (saturation of realism, failure of prior alignment findings to replicate, need for disentangled metrics) are direct empirical outcomes of this new data collection and protocol application, not reductions of fitted parameters, self-definitions, or self-citation chains. The work is self-contained against external benchmarks via released stimuli and votes; no load-bearing step equates to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human judgments in crowdsourced pairwise comparisons reliably reflect perceived motion realism and speech-gesture alignment.
Forward citations
Cited by 3 Pith papers
-
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...
-
Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures
Avatar and face representation systematically shift perceptual judgments of synthesized co-speech gestures.
-
Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures
Avatar appearance and facial presentation systematically bias perceptual judgments of synthesized co-speech gestures.
Reference graph
Works this paper leans on
-
[1]
Mohammad Mahdi Abootorabi, Omid Ghahroodi, Par- dis Sadat Zahraei, Hossein Behzadasl, Alireza Mirrokni, Mobina Salimipanah, Arash Rasouli, Bahar Behzadipour, Sara Azarnoush, Benyamin Maleki, et al. Generative AI for character animation: A comprehensive survey of tech- niques, applications, and future directions.arXiv preprint arXiv:2504.19056, 2025. 1
-
[2]
No gestures left behind: Learning rela- tionships between spoken language and freeform gestures
Chaitanya Ahuja, Dong Won Lee, Ryo Ishii, and Louis- Philippe Morency. No gestures left behind: Learning rela- tionships between spoken language and freeform gestures. InProceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing: Findings, pages 1884– 1895, 2020. 3
work page 2020
-
[3]
Continual learning for personalized co-speech gesture generation
Chaitanya Ahuja, Pratik Joshi, Ryo Ishii, and Louis-Philippe Morency. Continual learning for personalized co-speech gesture generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 20893–20903, 2023. 3
work page 2023
-
[4]
Style-controllable speech-driven gesture synthesis using normalising flows
Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. Style-controllable speech-driven gesture synthesis using normalising flows. InComputer Graphics Forum, pages 487–496. Wiley Online Library, 2020. 2
work page 2020
-
[5]
Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023. 2, 3, 5, 22
work page 2023
-
[6]
Tenglong Ao, Zeyi Zhang, and Libin Liu. GestureDiffu- CLIP: Gesture diffusion model with CLIP latents.ACM Transactions on Graphics (TOG), 42(4):1–18, 2023. 2, 3, 5
work page 2023
-
[7]
Kenneth J. Arrow. A difficulty in the concept of social wel- fare.Journal of Political Economy, 58(4):328–346, 1950. 9
work page 1950
-
[8]
Why spiderman is such a good dancer.https : / / web
Jody Avirgan. Why spiderman is such a good dancer.https : / / web . archive . org / web / 20201112011116/https://www.wnycstudios. org / podcasts / radiolab / articles / 299399 - why-spiderman-such-good-dancer, 2013. 9
work page 2013
-
[9]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 21
work page 2020
-
[10]
Yoav Benjamini and Yosef Hochberg. On the adaptive con- trol of the false discovery rate in multiple testing with inde- 10 pendent statistics.J. Educ. Behav. Stat., 25(1):60–83, 2000. 17
work page 2000
-
[11]
Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, and Marzieh Fadaee. Elo uncovered: Robustness and best practices in language model evaluation.Advances in Neural Information Processing Systems, 37:106135–106161, 2024. 9
work page 2024
-
[12]
Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 6, 16
work page 1952
-
[13]
Zoya Bylinskii, Laura Herman, Aaron Hertzmann, Stefanie Hutka, Yile Zhang, et al. Towards better user studies in com- puter graphics and vision.Foundations and Trends® in Com- puter Graphics and Vision, 15(3):201–252, 2023. 1
work page 2023
-
[14]
A V-Flow: Transforming text to audio-visual human-like interactions.arXiv preprint arXiv:2502.13133,
Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollh ¨ofer, Dimitris Samaras, and Alexan- der Richard. A V-Flow: Transforming text to audio-visual human-like interactions.arXiv preprint arXiv:2502.13133,
-
[15]
Motion-example-controlled co-speech ges- ture generation leveraging large language models
Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, and Kun Zhou. Motion-example-controlled co-speech ges- ture generation leveraging large language models. InPro- ceedings of the Special Interest Group on Computer Graph- ics and Interactive Techniques Conference Conference Pa- pers, New York, NY , USA, 2025. Association for Computing Machinery. 3
work page 2025
-
[16]
The language of motion: Unifying verbal and non-verbal language of 3d human motion
Changan Chen, Juze Zhang, Shrinidhi Kowshika Laksh- mikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei- Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3
work page 2025
-
[17]
Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and ges- ture generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024. 3, 5
work page 2024
-
[18]
Hop: Heterogeneous topology-based mul- timodal entanglement for co-speech gesture generation
Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, and Yanwei Fu. Hop: Heterogeneous topology-based mul- timodal entanglement for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025. 3, 5
work page 2025
-
[19]
Qingrong Cheng, Xu Li, and Xinghui Fu. Siggesture: Gen- eralized co-speech gesture synthesis via semantic injection with large-scale pre-training diffusion models. InSIG- GRAPH Asia 2024 Conference Papers, New York, NY , USA,
work page 2024
-
[20]
Association for Computing Machinery. 3
-
[21]
Yongkang Cheng and Shaoli Huang. HoloGest: Decoupled diffusion and motion priors for generating holisticly expres- sive co-speech gestures. InProceedings of the International Conference on 3D Vision, 2025. 7, 8, 22
work page 2025
-
[22]
Kiran Chhatre, Radek Dan ˇeˇcek, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, and Timo Bolkart. AMUSE: Emotional speech-driven 3D body ani- mation via disentangled latent diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1942–1953, 2024. 2, 3, 5, 7, 20, 22
work page 1942
-
[23]
Wei-Lin Chiang, Tim Li, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: New models & Elo system up- date.https://lmsys.org/blog/2023- 12- 07- leaderboard/, 2023. Accessed: 2025-05-20. 16
work page 2023
-
[24]
Effectively unbiased FID and Inception score and where to find them
Min Jin Chong and David Forsyth. Effectively unbiased FID and Inception score and where to find them. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6070–6079, 2020. 23
work page 2020
-
[25]
Investigating range- equalizing bias in mean opinion score ratings of synthesized speech
Erica Cooper and Junichi Yamagishi. Investigating range- equalizing bias in mean opinion score ratings of synthesized speech. InProc. Interspeech, pages 1104–1108, 2023. 4
work page 2023
-
[26]
Karlo Crnek, Grega Mo ˇcnik, and Matej Rojc. Advancing objective evaluation of speech-driven gesture generation for embodied conversational agents.International Journal of Human–Computer Interaction, 0(0):1–17, 2025. 2, 10
work page 2025
-
[27]
Mofusion: A framework for denoising-diffusion-based motion synthesis
Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion-based motion synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2
work page 2023
-
[28]
Diffusion-based co-speech gesture genera- tion using joint text and audio representation
Anna Deichler, Shivam Mehta, Simon Alexanderson, and Jonas Beskow. Diffusion-based co-speech gesture genera- tion using joint text and audio representation. InProceedings of the International Conference on Multimodal Interaction, pages 755–762, 2023. 2
work page 2023
-
[29]
Arpad E. Elo. The proposed USCF rating system, its devel- opment, theory, and applications.Chess Life, 22(8):242–247,
-
[30]
Cathy Ennis, Rachel McDonnell, and Carol O’Sullivan. See- ing is believing: body motion dominates in multisensory conversations.ACM Transactions on Graphics (TOG), 29 (4):1–9, 2010. 4
work page 2010
-
[31]
Investigating the use of recurrent motion modelling for speech gesture generation
Ylva Ferstl and Rachel McDonnell. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the ACM International Conference on Intel- ligent Virtual Agents, pages 93–98, 2018. 2, 3
work page 2018
-
[32]
Zeroeggs: Zero-shot example-based gesture generation from speech
Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F Troje, and Marc-Andr ´e Carbonneau. Zeroeggs: Zero-shot example-based gesture generation from speech. InCom- puter Graphics Forum, pages 206–216. Wiley Online Li- brary, 2023. 2, 3, 21
work page 2023
-
[33]
Factorizing text-to-video generation by explicit image conditioning
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. InProceedings of the European Conference on Computer Vision, pages 205– 224, 2024. 6
work page 2024
- [34]
-
[35]
Evaluation of speech-to-gesture generation using bi-directional LSTM network
Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. Evaluation of speech-to-gesture generation using bi-directional LSTM network. InProceed- ings of the ACM International Conference on Intelligent Vir- tual Agents, pages 79–86, New York, NY , USA, 2018. ACM. 2 11
work page 2018
-
[36]
Zhiyuan He. Automatic quality assessment of speech-driven synthesized gestures.International Journal of Computer Games Technology, 2022, 2022. 10
work page 2022
-
[37]
The curse of performative user studies
Aaron Hertzmann. The curse of performative user studies. IEEE Computer Graphics and Applications, 43(6):112–116,
-
[38]
Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, and Germain Forestier. Establishing a uni- fied evaluation framework for human motion generation: A comparative analysis of metrics.Computer Vision and Image Understanding, 254:104337, 2025. 10
work page 2025
-
[39]
Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic set- tings. InProceedings of the ACM International Conference on Intelligent Virtual Agents, 2020. 4, 7
work page 2020
-
[40]
Maurice George Kendall.Rank correlation methods.Griffin,
-
[41]
Analyzing input and output representations for speech-driven gesture gener- ation
Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellstr ¨om. Analyzing input and output representations for speech-driven gesture gener- ation. InProceedings of the ACM International Conference on Intelligent Virtual Agents, pages 97–104, New York, NY , USA, 2019. ACM. 2
work page 2019
-
[42]
Gesticulator: A framework for semantically-aware speech-driven gesture generation
Taras Kucherenko, Patrik Jonell, Sanne Van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellstr ¨om. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Mul- timodal Interaction, pages 242–250, 2020
work page 2020
-
[43]
Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gus- tav Eje Henter, and Hedvig Kjellstr ¨om. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation.Interna- tional Journal of Human-Computer Interaction, 37(14): 1300–1316, 2021. 2
work page 2021
-
[44]
Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. A large, crowdsourced eval- uation of gesture generation systems on common data: The genea challenge 2020. In26th international conference on intelligent user interfaces, pages 11–21, 2021. 2, 4, 15, 20
work page 2020
-
[45]
Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, and Gus- tav Eje Henter. The GENEA Challenge 2023: A large- scale evaluation of gesture generation models in monadic and dyadic settings. InProceedings of the International Con- ference on Multimodal Interaction, pages 792–801, 2023. 4, 7
work page 2023
-
[46]
Taras Kucherenko, Pieter Wolfert, Youngwoo Yoon, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Hen- ter. Evaluating gesture generation in a large-scale open chal- lenge: The GENEA Challenge 2022.ACM Transactions on Graphics (TOG), 2024. 2, 4, 7, 15, 19, 23
work page 2022
-
[47]
Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. Talking With Hands 16.2 M: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772,
-
[48]
Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3D dance generation with AIST++. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13401–13412, 2021. 22
work page 2021
-
[49]
BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis
Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InProceedings of the European Conference on Computer Vision, pages 612– 630, 2022. 3, 22
work page 2022
-
[50]
Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. EMAGE: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1144–1154, 2024. 1, 3...
work page 2024
-
[51]
Lanmiao Liu, Esam Ghaleb, Aslı ¨Ozy¨urek, and Zerrin Yu- mak. Semges: Semantics-aware co-speech gesture gener- ation using semantic coherence and relevance learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 3
work page 2025
-
[52]
Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling
Pinxin Liu, Luchuan Song, Junhua Huang, and Chenliang Xu. Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025. 3
work page 2025
-
[53]
Learning hierarchical cross-modal association for co- speech gesture generation
Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. Learning hierarchical cross-modal association for co- speech gesture generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022. 2, 22
work page 2022
-
[54]
Speech-based gesture generation for robots and embodied agents: A scoping review
Yu Liu, Gelareh Mohammadi, Yang Song, and Wafa Johal. Speech-based gesture generation for robots and embodied agents: A scoping review. InProceedings of the Interna- tional Conference on Human-Agent Interaction, pages 31– 38, 2021. 1
work page 2021
-
[55]
Towards variable and coordinated holistic co-speech motion generation
Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. Towards variable and coordinated holistic co-speech motion generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1566–1576, 2024. 3
work page 2024
-
[56]
Rachel McDonnell, Martin Breidt, and Heinrich H B ¨ulthoff. Render me real? investigating the effect of render style on the perception of animated virtual humans.ACM Transac- tions on Graphics (TOG), 31(4):1–11, 2012. 5
work page 2012
-
[57]
Jared E. Miller, Laura A. Carlson, and J. Devin McAuley. When what you hear influences when you see: listening to an auditory rhythm influences the temporal allocation of visual attention.Psychological Science, 24(1):11–18, 2013. 9
work page 2013
-
[58]
Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis
Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, and Christian 12 Theobalt. Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 7, 8, 21
work page 2024
-
[59]
Hamza Mughal, Rishabh Dabral, Merel C
M. Hamza Mughal, Rishabh Dabral, Merel C. J. Scholman, Vera Demberg, and Christian Theobalt. Retrieving semantics from the deep: an rag solution for gesture synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3, 4, 5, 7, 8, 17, 21
work page 2025
-
[60]
Rajmund Nagy, Hendric V oss, Youngwoo Yoon, Taras Kucherenko, Teodor Nikolov, Thanh Hoang-Minh, Rachel McDonnell, Stefan Kopp, Michael Neff, and Gustav Eje Henter. Towards a genea leaderboard–an extended, living benchmark for evaluating and advancing conversational mo- tion synthesis.arXiv preprint arXiv:2410.06327, 2024. 10
-
[61]
From audio to photoreal embodiment: Synthesizing humans in conversations
Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, and Alexander Richard. From audio to photoreal embodiment: Synthesizing humans in conversations. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 5, 18, 19, 22
work page 2024
-
[62]
A comprehensive re- view of data-driven co-speech gesture generation
Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. A comprehensive re- view of data-driven co-speech gesture generation. InCom- puter Graphics Forum, pages 569–596. Wiley Online Li- brary, 2023. 1, 2, 17
work page 2023
-
[63]
Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki Shiratori, Junichi Yamagishi, and Taku Komura. Bodyformer: Semantics-guided 3d body gesture synthesis with transformer.ACM Transactions on Graphics (TOG), 42(4):1–12, 2023. 3, 5, 20
work page 2023
-
[64]
Expressive body capture: 3d hands, face, and body from a single image
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 6
work page 2019
-
[65]
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019. 20
work page 2019
-
[66]
Olivier Perrotin, Brooke Stephenson, Silvain Gerber, and G´erard Bailly. The blizzard challenge 2023. In18th Bliz- zard Challenge Workshop, pages 1–27. ISCA, 2023. 15
work page 2023
-
[67]
Wim Pouw, Shannon Proksch, Linda Drijvers, Marco Gamba, Judith Holler, Christopher Kello, Rebecca S. Schae- fer, and Geraint A. Wiggins. Multilevel rhythms in multi- modal communication.P . Roy. Soc. B, 376(1835), 2021. 9
work page 2021
-
[68]
Weakly-supervised emotion tran- sition learning for diverse 3d co-speech gesture generation
Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, and Yike Guo. Weakly-supervised emotion tran- sition learning for diverse 3d co-speech gesture generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10424–10434, 2024. 3, 5
work page 2024
-
[69]
Passing a non-verbal Turing test: Evaluating gesture anima- tions generated from speech
Manuel Rebol, Christian G ¨uti, and Krzysztof Pietroszek. Passing a non-verbal Turing test: Evaluating gesture anima- tions generated from speech. InProceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces, pages 573–581. IEEE, 2021. 4, 7
work page 2021
-
[70]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 22
work page 2022
-
[71]
The importance of quali- tative elements in subjective evaluation of semantic gestures
Carolyn Saund and Stacy Marsella. The importance of quali- tative elements in subjective evaluation of semantic gestures. In2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pages 1–8. IEEE,
work page 2021
-
[72]
Co-speech ges- ture synthesis by reinforcement learning with contrastive pre-trained rewards
Mingyang Sun, Mengchen Zhao, Yaqing Hou, Minglei Li, Huang Xu, Songcen Xu, and Jianye Hao. Co-speech ges- ture synthesis by reinforcement learning with contrastive pre-trained rewards. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2331–2340, 2023. 3
work page 2023
-
[73]
Speech-to- gesture generation: A challenge in deep learning approach with bi-directional LSTM
Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. Speech-to- gesture generation: A challenge in deep learning approach with bi-directional LSTM. InProceedings of the Interna- tional Conference on Human Agent Interaction, 2017. 2
work page 2017
-
[74]
Training data-efficient image transformers & distillation through at- tention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the International Conference on Machine Learning, pages 10347–10357. PMLR, 2021. 22
work page 2021
-
[75]
EDGE: Editable dance generation from music
Jonathan Tseng, Rodrigo Castellon, and Karen Liu. EDGE: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023. 2
work page 2023
-
[76]
Aq-gt: a temporally aligned and quantized gru-transformer for co-speech gesture synthe- sis
Hendric V oß and Stefan Kopp. Aq-gt: a temporally aligned and quantized gru-transformer for co-speech gesture synthe- sis. InProceedings of the 25th International Conference on Multimodal Interaction, pages 60–69, 2023. 2
work page 2023
-
[77]
Girard, Taras Kucherenko, and Tony Belpaeme
Pieter Wolfert, Jeffrey M. Girard, Taras Kucherenko, and Tony Belpaeme. To rate or not to rate: Investigating eval- uation methods for generated co-speech gestures. InProc. ICMI, pages 494–502. ACM, 2021. 6
work page 2021
-
[78]
Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. A re- view of evaluation practices of gesture generation in embod- ied conversational agents.IEEE Transactions on Human- Machine Systems, 52(3):379–389, 2022. 2
work page 2022
-
[79]
Probabilistic speech- driven 3d facial motion synthesis: new benchmarks meth- ods and applications
Karren D Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, and Oncel Tuzel. Probabilistic speech- driven 3d facial motion synthesis: new benchmarks meth- ods and applications. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 27294–27303, 2024. 2
work page 2024
-
[80]
Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Lingyun Yu, Yingying Li, Haocheng Feng, Hang Zhou, and Hongtao Xie. Ges- turehydra: Semantic co-speech gesture synthesis via hybrid modality diffusion transformer and cascaded-synchronized retrieval-augmented generation. InProceedings of the 13 IEEE/CVF International Conferen...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.