Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

(10) Trinity College Dublin; (11) University of California; (12) SEED -- Electronic Arts; (13) Electronics; (2) Bielefeld University; (3) University of Science -- VNUHCM; (4) Independent Researcher; 5) ((1) KTH Royal Institute of Technology; (5) Motorica AB; (6) Peking University

arxiv: 2511.01233 · v4 · submitted 2025-11-03 · 💻 cs.CV · cs.GR· cs.HC

Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy (1) , Hendric Voss (2) , Thanh Hoang-Minh (3) , Mihail Tsakov (4) , Teodor Nikolov (5) , Zeyi Zhang (6) , Tenglong Ao (6) , Sicheng Yang (7)

show 29 more authors

Shaoli Huang (8) Yongkang Cheng (8) M. Hamza Mughal (9) Rishabh Dabral (9) Kiran Chhatre (1) Christian Theobalt (9) Libin Liu (6) Stefan Kopp (2) Rachel McDonnell (10) Michael Neff (11) Taras Kucherenko (12) Youngwoo Yoon (13) Gustav Eje Henter (1 5) ((1) KTH Royal Institute of Technology (2) Bielefeld University (3) University of Science -- VNUHCM (4) Independent Researcher (5) Motorica AB (6) Peking University (7) Huawei Technologies Ltd. (8) Astribot (9) Max-Planck Institute for Informatics SIC (10) Trinity College Dublin (11) University of California Davis (12) SEED -- Electronic Arts (13) Electronics Telecommunications Research Institute (ETRI))

This is my paper

Pith reviewed 2026-05-18 01:36 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.HC

keywords gesture generationhuman evaluationspeech-driven gesturesBEAT2 datasetmotion realismspeech-gesture alignmentbenchmarkingcrowdsourced evaluation

0 comments

The pith

Standardized human evaluations show motion realism has saturated for gesture generation models on the BEAT2 dataset while speech alignment claims fail to hold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how human evaluations of speech-driven 3D gesture generators have lacked consistent methods, making it hard to compare approaches or identify real progress. To fix this, the authors create a detailed protocol for the popular BEAT2 motion-capture dataset and use it to run large crowdsourced tests on six recent models trained by their original teams. The tests separate two qualities: how natural the movements look on their own, and how well they match the spoken words. Results indicate that realism scores no longer distinguish newer models from older ones, and alignment scores are lower than earlier studies suggested even for models built specifically for that goal. The work releases rendered videos and human votes so others can test without rebuilding models, and it argues that future benchmarking needs these two qualities measured apart.

Core claim

Applying the new protocol across six author-trained models on BEAT2 reveals that motion realism has become saturated, with older models matching recent ones, while prior reports of strong speech-gesture alignment do not survive rigorous pairwise testing; therefore accurate progress requires separate measurement of motion quality and multimodal alignment rather than combined scores.

What carries the argument

The crowdsourced human evaluation protocol that disentangles motion realism from speech-gesture alignment through large-scale pairwise preference votes on rendered video stimuli from the BEAT2 dataset.

If this is right

Motion realism can no longer serve as a useful benchmark on BEAT2 because older and newer models perform on par.
Claims of high speech-gesture alignment from earlier work do not replicate under controlled conditions even for models designed for alignment.
Benchmarking must separate motion quality from multimodal alignment to avoid misleading combined scores.
The released five hours of synthetic motion and 750+ video stimuli enable new studies without requiring model reimplementation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model development could shift focus toward improving alignment while preserving the already high realism baseline.
The released preference votes and rendering script create a reusable testbed that other multimodal generation fields might adapt for their own evaluation standards.
If the saturation finding generalizes, research resources may move away from pure realism metrics toward timing, semantics, or style control in gestures.

Load-bearing premise

The crowdsourced protocol itself introduces no new biases from the participant pool or platform that would distort the rankings of realism and alignment.

What would settle it

A replication of the same pairwise votes using a different crowdsourcing platform or screened participant group that produces substantially different model rankings or restores high alignment scores for specialized models.

Figures

Figures reproduced from arXiv: 2511.01233 by (10) Trinity College Dublin, (11) University of California, (12) SEED -- Electronic Arts, (13) Electronics, (2) Bielefeld University, (3) University of Science -- VNUHCM, (4) Independent Researcher, 5) ((1) KTH Royal Institute of Technology, (5) Motorica AB, (6) Peking University, (7) Huawei Technologies Ltd., (8) Astribot, (9) Max-Planck Institute for Informatics, Christian Theobalt (9), Davis, Gustav Eje Henter (1, Hendric Voss (2), Kiran Chhatre (1), Libin Liu (6), M. Hamza Mughal (9), Michael Neff (11), Mihail Tsakov (4), Rachel McDonnell (10), Rajmund Nagy (1), Rishabh Dabral (9), Shaoli Huang (8), SIC, Sicheng Yang (7), Stefan Kopp (2), Taras Kucherenko (12), Telecommunications Research Institute (ETRI)), Tenglong Ao (6), Teodor Nikolov (5), Thanh Hoang-Minh (3), Yongkang Cheng (8), Youngwoo Yoon (13), Zeyi Zhang (6).

**Figure 3.** Figure 3: Example embodiments used in recent evaluations [ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Results of the motion-realism user study, in the form of Elo ratings for each condition considered and 95% confidence intervals [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Results of the speech-gesture appropriateness user study. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Questions and response options in the two types of user studies, also showing their schematic layout in the user-study GUI. For [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: A screenshot of the GUI for the user studies, specifically from a motion-realism test with the current screen containing stimulus [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Frequency of JUICE options chosen for each model dur [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 10.** Figure 10: A video frame showing a gesturing SMPL-X avatar [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

We review human evaluation practices in automatic, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results show that 1) motion realism has become a saturated evaluation measure on the BEAT2 dataset, with older models performing on par with more recent approaches; 2) previous findings of high speech-gesture alignment do not hold up under rigorous evaluation, even for specialised models; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. To drive standardisation and enable new evaluation research, we release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without requiring model reimplementation -- alongside our open-source rendering script, and 16,000 pairwise human preference votes collected for our benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the gesture generation field a standardized protocol plus a big public release of videos and votes, but its claims of saturated realism and non-replicating alignment scores depend on the crowdsourced setup being free of hidden biases.

read the letter

The main point is that motion realism on BEAT2 now looks saturated under their human study, with older models scoring on par with newer ones, while earlier reports of strong speech-gesture alignment fail to hold up. They argue the field needs separate measures for motion quality and multimodal fit going forward, and they back this with new data rather than re-analysis of old numbers.

Referee Report

2 major / 3 minor

Summary. The manuscript reviews human evaluation practices in automatic speech-driven 3D gesture generation, identifying a lack of standardization and frequent use of flawed experimental setups that prevent reliable comparisons or identification of the state of the art. To address these issues, the authors introduce a detailed human evaluation protocol for the BEAT2 motion-capture dataset. They apply this protocol in a large-scale crowdsourced study ranking six recent gesture-generation models (each trained by its original authors) on two dimensions: motion realism and speech-gesture alignment. Results indicate that motion realism has become saturated on BEAT2 (older models perform on par with recent ones), that prior findings of high speech-gesture alignment do not replicate under rigorous evaluation, and that the field should adopt disentangled assessments of motion quality and multimodal alignment. The authors release five hours of synthetic motion, over 750 rendered video stimuli, an open-source rendering script, and 16,000 pairwise human preference votes to support standardization and future research.

Significance. If the protocol proves robust, the work could meaningfully advance benchmarking standards in gesture generation by providing a reproducible protocol and releasing extensive resources (model outputs, video stimuli, rendering code, and a large set of human votes). These releases are a clear strength for reproducibility and enable new evaluations without model reimplementation. The findings on saturation and alignment replication could prompt the community to move beyond saturated or confounded metrics, though this depends on addressing the protocol's documentation.

major comments (2)

[§4 (Evaluation Protocol)] §4 (Evaluation Protocol): The manuscript provides limited detail on participant filtering, attention checks, demographic controls, and any calibration against expert or lab-based raters. Because the headline claims of motion realism saturation on BEAT2 and non-replication of prior alignment results rest on the crowdsourced protocol producing unbiased rankings, these aspects require fuller specification to substantiate the conclusions.
[§5 (Results)] §5 (Results): The evidence for saturation (older models performing on par with recent ones) and the alignment findings should include explicit statistical tests, p-values, effect sizes, or confidence intervals in the relevant tables or figures. Without these, it is difficult to assess whether the observed parity or differences are statistically meaningful or merely due to variance in the crowdsourced data.

minor comments (3)

Figure captions should explicitly describe what error bars represent (e.g., standard error or 95% CI) and clarify the exact comparison being shown in each panel.
Notation for the two evaluation dimensions (motion realism vs. speech-gesture alignment) should be used consistently throughout the text and figures to avoid ambiguity.
[Related Work] The related-work section would benefit from citing any very recent (post-2023) gesture-generation papers that also discuss evaluation practices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional details and analyses where feasible, thereby strengthening the transparency and statistical rigor of our work.

read point-by-point responses

Referee: [§4 (Evaluation Protocol)] §4 (Evaluation Protocol): The manuscript provides limited detail on participant filtering, attention checks, demographic controls, and any calibration against expert or lab-based raters. Because the headline claims of motion realism saturation on BEAT2 and non-replication of prior alignment results rest on the crowdsourced protocol producing unbiased rankings, these aspects require fuller specification to substantiate the conclusions.

Authors: We appreciate the referee's emphasis on rigorous documentation of the crowdsourcing procedure. Section 4 of the manuscript already describes the use of attention checks, basic participant filtering based on response quality, and collection of demographic information as part of the protocol for the BEAT2 dataset. To address this comment directly, we will expand the section with more granular specifications of the filtering thresholds, attention check design, and demographic breakdowns. Regarding calibration against expert or lab-based raters, the study did not include a direct comparison; we followed common practices for large-scale crowdsourced evaluations in generative modeling. In the revision we will add an explicit discussion of this design choice, its alignment with prior literature, and any associated limitations. revision: partial
Referee: [§5 (Results)] §5 (Results): The evidence for saturation (older models performing on par with recent ones) and the alignment findings should include explicit statistical tests, p-values, effect sizes, or confidence intervals in the relevant tables or figures. Without these, it is difficult to assess whether the observed parity or differences are statistically meaningful or merely due to variance in the crowdsourced data.

Authors: We agree that explicit statistical support is important for interpreting the saturation and alignment results. The manuscript currently presents preference rankings and percentages from the 16,000 votes. In the revised version we will augment the relevant tables and figures in Section 5 with appropriate non-parametric statistical tests (e.g., Friedman test followed by post-hoc Wilcoxon signed-rank tests with correction), p-values, effect sizes (rank-biserial correlation), and 95% confidence intervals for the key comparisons. These additions will be computed from the existing preference data and will clarify the statistical meaningfulness of the observed model parity in motion realism and the alignment findings. revision: yes

Circularity Check

0 steps flagged

New crowdsourced human preference data yields independent benchmark results

full rationale

The paper reviews prior evaluation practices, introduces a detailed protocol for the BEAT2 dataset, and reports results from a fresh large-scale crowdsourced study collecting 16,000 pairwise votes on motion realism and speech-gesture alignment for six models. The headline claims (saturation of realism, failure of prior alignment findings to replicate, need for disentangled metrics) are direct empirical outcomes of this new data collection and protocol application, not reductions of fitted parameters, self-definitions, or self-citation chains. The work is self-contained against external benchmarks via released stimuli and votes; no load-bearing step equates to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's claims rest on the validity of their new evaluation protocol and the representativeness of the crowdsourced data.

axioms (1)

domain assumption Human judgments in crowdsourced pairwise comparisons reliably reflect perceived motion realism and speech-gesture alignment.
This underpins the entire evaluation protocol and results.

pith-pipeline@v0.9.0 · 6042 in / 1317 out tokens · 43722 ms · 2026-05-18T01:36:15.450863+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
cs.CV 2025-12 unverdicted novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...
Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures
cs.GR 2026-05 unverdicted novelty 6.0

Avatar and face representation systematically shift perceptual judgments of synthesized co-speech gestures.
Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures
cs.GR 2026-05 unverdicted novelty 5.0

Avatar appearance and facial presentation systematically bias perceptual judgments of synthesized co-speech gestures.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 2 Pith papers

[1]

Generative AI for character animation: A comprehensive survey of tech- niques, applications, and future directions.arXiv preprint arXiv:2504.19056, 2025

Mohammad Mahdi Abootorabi, Omid Ghahroodi, Par- dis Sadat Zahraei, Hossein Behzadasl, Alireza Mirrokni, Mobina Salimipanah, Arash Rasouli, Bahar Behzadipour, Sara Azarnoush, Benyamin Maleki, et al. Generative AI for character animation: A comprehensive survey of tech- niques, applications, and future directions.arXiv preprint arXiv:2504.19056, 2025. 1

work page arXiv 2025
[2]

No gestures left behind: Learning rela- tionships between spoken language and freeform gestures

Chaitanya Ahuja, Dong Won Lee, Ryo Ishii, and Louis- Philippe Morency. No gestures left behind: Learning rela- tionships between spoken language and freeform gestures. InProceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing: Findings, pages 1884– 1895, 2020. 3

work page 2020
[3]

Continual learning for personalized co-speech gesture generation

Chaitanya Ahuja, Pratik Joshi, Ryo Ishii, and Louis-Philippe Morency. Continual learning for personalized co-speech gesture generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 20893–20903, 2023. 3

work page 2023
[4]

Style-controllable speech-driven gesture synthesis using normalising flows

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. Style-controllable speech-driven gesture synthesis using normalising flows. InComputer Graphics Forum, pages 487–496. Wiley Online Library, 2020. 2

work page 2020
[5]

Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023

Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023. 2, 3, 5, 22

work page 2023
[6]

GestureDiffu- CLIP: Gesture diffusion model with CLIP latents.ACM Transactions on Graphics (TOG), 42(4):1–18, 2023

Tenglong Ao, Zeyi Zhang, and Libin Liu. GestureDiffu- CLIP: Gesture diffusion model with CLIP latents.ACM Transactions on Graphics (TOG), 42(4):1–18, 2023. 2, 3, 5

work page 2023
[7]

Kenneth J. Arrow. A difficulty in the concept of social wel- fare.Journal of Political Economy, 58(4):328–346, 1950. 9

work page 1950
[8]

Why spiderman is such a good dancer.https : / / web

Jody Avirgan. Why spiderman is such a good dancer.https : / / web . archive . org / web / 20201112011116/https://www.wnycstudios. org / podcasts / radiolab / articles / 299399 - why-spiderman-such-good-dancer, 2013. 9

work page 2013
[9]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 21

work page 2020
[10]

On the adaptive con- trol of the false discovery rate in multiple testing with inde- 10 pendent statistics.J

Yoav Benjamini and Yosef Hochberg. On the adaptive con- trol of the false discovery rate in multiple testing with inde- 10 pendent statistics.J. Educ. Behav. Stat., 25(1):60–83, 2000. 17

work page 2000
[11]

Elo uncovered: Robustness and best practices in language model evaluation.Advances in Neural Information Processing Systems, 37:106135–106161, 2024

Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, and Marzieh Fadaee. Elo uncovered: Robustness and best practices in language model evaluation.Advances in Neural Information Processing Systems, 37:106135–106161, 2024. 9

work page 2024
[12]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 6, 16

work page 1952
[13]

Towards better user studies in com- puter graphics and vision.Foundations and Trends® in Com- puter Graphics and Vision, 15(3):201–252, 2023

Zoya Bylinskii, Laura Herman, Aaron Hertzmann, Stefanie Hutka, Yile Zhang, et al. Towards better user studies in com- puter graphics and vision.Foundations and Trends® in Com- puter Graphics and Vision, 15(3):201–252, 2023. 1

work page 2023
[14]

A V-Flow: Transforming text to audio-visual human-like interactions.arXiv preprint arXiv:2502.13133,

Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollh ¨ofer, Dimitris Samaras, and Alexan- der Richard. A V-Flow: Transforming text to audio-visual human-like interactions.arXiv preprint arXiv:2502.13133,

work page arXiv
[15]

Motion-example-controlled co-speech ges- ture generation leveraging large language models

Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, and Kun Zhou. Motion-example-controlled co-speech ges- ture generation leveraging large language models. InPro- ceedings of the Special Interest Group on Computer Graph- ics and Interactive Techniques Conference Conference Pa- pers, New York, NY , USA, 2025. Association for Computing Machinery. 3

work page 2025
[16]

The language of motion: Unifying verbal and non-verbal language of 3d human motion

Changan Chen, Juze Zhang, Shrinidhi Kowshika Laksh- mikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei- Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3

work page 2025
[17]

Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and ges- ture generation

Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and ges- ture generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024. 3, 5

work page 2024
[18]

Hop: Heterogeneous topology-based mul- timodal entanglement for co-speech gesture generation

Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, and Yanwei Fu. Hop: Heterogeneous topology-based mul- timodal entanglement for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025. 3, 5

work page 2025
[19]

Siggesture: Gen- eralized co-speech gesture synthesis via semantic injection with large-scale pre-training diffusion models

Qingrong Cheng, Xu Li, and Xinghui Fu. Siggesture: Gen- eralized co-speech gesture synthesis via semantic injection with large-scale pre-training diffusion models. InSIG- GRAPH Asia 2024 Conference Papers, New York, NY , USA,

work page 2024
[20]

Association for Computing Machinery. 3

work page
[21]

HoloGest: Decoupled diffusion and motion priors for generating holisticly expres- sive co-speech gestures

Yongkang Cheng and Shaoli Huang. HoloGest: Decoupled diffusion and motion priors for generating holisticly expres- sive co-speech gestures. InProceedings of the International Conference on 3D Vision, 2025. 7, 8, 22

work page 2025
[22]

Black, and Timo Bolkart

Kiran Chhatre, Radek Dan ˇeˇcek, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, and Timo Bolkart. AMUSE: Emotional speech-driven 3D body ani- mation via disentangled latent diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1942–1953, 2024. 2, 3, 5, 7, 20, 22

work page 1942
[23]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Tim Li, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: New models & Elo system up- date.https://lmsys.org/blog/2023- 12- 07- leaderboard/, 2023. Accessed: 2025-05-20. 16

work page 2023
[24]

Effectively unbiased FID and Inception score and where to find them

Min Jin Chong and David Forsyth. Effectively unbiased FID and Inception score and where to find them. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6070–6079, 2020. 23

work page 2020
[25]

Investigating range- equalizing bias in mean opinion score ratings of synthesized speech

Erica Cooper and Junichi Yamagishi. Investigating range- equalizing bias in mean opinion score ratings of synthesized speech. InProc. Interspeech, pages 1104–1108, 2023. 4

work page 2023
[26]

Advancing objective evaluation of speech-driven gesture generation for embodied conversational agents.International Journal of Human–Computer Interaction, 0(0):1–17, 2025

Karlo Crnek, Grega Mo ˇcnik, and Matej Rojc. Advancing objective evaluation of speech-driven gesture generation for embodied conversational agents.International Journal of Human–Computer Interaction, 0(0):1–17, 2025. 2, 10

work page 2025
[27]

Mofusion: A framework for denoising-diffusion-based motion synthesis

Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion-based motion synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023
[28]

Diffusion-based co-speech gesture genera- tion using joint text and audio representation

Anna Deichler, Shivam Mehta, Simon Alexanderson, and Jonas Beskow. Diffusion-based co-speech gesture genera- tion using joint text and audio representation. InProceedings of the International Conference on Multimodal Interaction, pages 755–762, 2023. 2

work page 2023
[29]

Arpad E. Elo. The proposed USCF rating system, its devel- opment, theory, and applications.Chess Life, 22(8):242–247,

work page
[30]

See- ing is believing: body motion dominates in multisensory conversations.ACM Transactions on Graphics (TOG), 29 (4):1–9, 2010

Cathy Ennis, Rachel McDonnell, and Carol O’Sullivan. See- ing is believing: body motion dominates in multisensory conversations.ACM Transactions on Graphics (TOG), 29 (4):1–9, 2010. 4

work page 2010
[31]

Investigating the use of recurrent motion modelling for speech gesture generation

Ylva Ferstl and Rachel McDonnell. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the ACM International Conference on Intel- ligent Virtual Agents, pages 93–98, 2018. 2, 3

work page 2018
[32]

Zeroeggs: Zero-shot example-based gesture generation from speech

Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F Troje, and Marc-Andr ´e Carbonneau. Zeroeggs: Zero-shot example-based gesture generation from speech. InCom- puter Graphics Forum, pages 206–216. Wiley Online Li- brary, 2023. 2, 3, 21

work page 2023
[33]

Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. InProceedings of the European Conference on Computer Vision, pages 205– 224, 2024. 6

work page 2024
[34]

wild west

Kazi Injamamul Haque, Alkiviadis Pavlou, and Zerrin Yu- mak. “wild west” of evaluating speech-driven 3d facial ani- mation synthesis: A benchmark study. InComputer Graph- ics Forum, page e70073. Wiley Online Library, 2025. 2

work page 2025
[35]

Evaluation of speech-to-gesture generation using bi-directional LSTM network

Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. Evaluation of speech-to-gesture generation using bi-directional LSTM network. InProceed- ings of the ACM International Conference on Intelligent Vir- tual Agents, pages 79–86, New York, NY , USA, 2018. ACM. 2 11

work page 2018
[36]

Automatic quality assessment of speech-driven synthesized gestures.International Journal of Computer Games Technology, 2022, 2022

Zhiyuan He. Automatic quality assessment of speech-driven synthesized gestures.International Journal of Computer Games Technology, 2022, 2022. 10

work page 2022
[37]

The curse of performative user studies

Aaron Hertzmann. The curse of performative user studies. IEEE Computer Graphics and Applications, 43(6):112–116,

work page
[38]

Establishing a uni- fied evaluation framework for human motion generation: A comparative analysis of metrics.Computer Vision and Image Understanding, 254:104337, 2025

Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, and Germain Forestier. Establishing a uni- fied evaluation framework for human motion generation: A comparative analysis of metrics.Computer Vision and Image Understanding, 254:104337, 2025. 10

work page 2025
[39]

Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic set- tings

Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic set- tings. InProceedings of the ACM International Conference on Intelligent Virtual Agents, 2020. 4, 7

work page 2020
[40]

Maurice George Kendall.Rank correlation methods.Griffin,

work page
[41]

Analyzing input and output representations for speech-driven gesture gener- ation

Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellstr ¨om. Analyzing input and output representations for speech-driven gesture gener- ation. InProceedings of the ACM International Conference on Intelligent Virtual Agents, pages 97–104, New York, NY , USA, 2019. ACM. 2

work page 2019
[42]

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Taras Kucherenko, Patrik Jonell, Sanne Van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellstr ¨om. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Mul- timodal Interaction, pages 242–250, 2020

work page 2020
[43]

Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gus- tav Eje Henter, and Hedvig Kjellstr ¨om. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation.Interna- tional Journal of Human-Computer Interaction, 37(14): 1300–1316, 2021. 2

work page 2021
[44]

A large, crowdsourced eval- uation of gesture generation systems on common data: The genea challenge 2020

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. A large, crowdsourced eval- uation of gesture generation systems on common data: The genea challenge 2020. In26th international conference on intelligent user interfaces, pages 11–21, 2021. 2, 4, 15, 20

work page 2020
[45]

The GENEA Challenge 2023: A large- scale evaluation of gesture generation models in monadic and dyadic settings

Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, and Gus- tav Eje Henter. The GENEA Challenge 2023: A large- scale evaluation of gesture generation models in monadic and dyadic settings. InProceedings of the International Con- ference on Multimodal Interaction, pages 792–801, 2023. 4, 7

work page 2023
[46]

Evaluating gesture generation in a large-scale open chal- lenge: The GENEA Challenge 2022.ACM Transactions on Graphics (TOG), 2024

Taras Kucherenko, Pieter Wolfert, Youngwoo Yoon, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Hen- ter. Evaluating gesture generation in a large-scale open chal- lenge: The GENEA Challenge 2022.ACM Transactions on Graphics (TOG), 2024. 2, 4, 7, 15, 19, 23

work page 2022
[47]

Srinivasa, and Yaser Sheikh

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. Talking With Hands 16.2 M: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772,

work page
[48]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3D dance generation with AIST++. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13401–13412, 2021. 22

work page 2021
[49]

BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InProceedings of the European Conference on Computer Vision, pages 612– 630, 2022. 3, 22

work page 2022
[50]

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. EMAGE: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1144–1154, 2024. 1, 3...

work page 2024
[51]

Semges: Semantics-aware co-speech gesture gener- ation using semantic coherence and relevance learning

Lanmiao Liu, Esam Ghaleb, Aslı ¨Ozy¨urek, and Zerrin Yu- mak. Semges: Semantics-aware co-speech gesture gener- ation using semantic coherence and relevance learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 3

work page 2025
[52]

Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling

Pinxin Liu, Luchuan Song, Junhua Huang, and Chenliang Xu. Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025. 3

work page 2025
[53]

Learning hierarchical cross-modal association for co- speech gesture generation

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. Learning hierarchical cross-modal association for co- speech gesture generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022. 2, 22

work page 2022
[54]

Speech-based gesture generation for robots and embodied agents: A scoping review

Yu Liu, Gelareh Mohammadi, Yang Song, and Wafa Johal. Speech-based gesture generation for robots and embodied agents: A scoping review. InProceedings of the Interna- tional Conference on Human-Agent Interaction, pages 31– 38, 2021. 1

work page 2021
[55]

Towards variable and coordinated holistic co-speech motion generation

Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. Towards variable and coordinated holistic co-speech motion generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1566–1576, 2024. 3

work page 2024
[56]

Render me real? investigating the effect of render style on the perception of animated virtual humans.ACM Transac- tions on Graphics (TOG), 31(4):1–11, 2012

Rachel McDonnell, Martin Breidt, and Heinrich H B ¨ulthoff. Render me real? investigating the effect of render style on the perception of animated virtual humans.ACM Transac- tions on Graphics (TOG), 31(4):1–11, 2012. 5

work page 2012
[57]

Miller, Laura A

Jared E. Miller, Laura A. Carlson, and J. Devin McAuley. When what you hear influences when you see: listening to an auditory rhythm influences the temporal allocation of visual attention.Psychological Science, 24(1):11–18, 2013. 9

work page 2013
[58]

Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis

Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, and Christian 12 Theobalt. Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 7, 8, 21

work page 2024
[59]

Hamza Mughal, Rishabh Dabral, Merel C

M. Hamza Mughal, Rishabh Dabral, Merel C. J. Scholman, Vera Demberg, and Christian Theobalt. Retrieving semantics from the deep: an rag solution for gesture synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3, 4, 5, 7, 8, 17, 21

work page 2025
[60]

Towards a genea leaderboard–an extended, living benchmark for evaluating and advancing conversational mo- tion synthesis.arXiv preprint arXiv:2410.06327, 2024

Rajmund Nagy, Hendric V oss, Youngwoo Yoon, Taras Kucherenko, Teodor Nikolov, Thanh Hoang-Minh, Rachel McDonnell, Stefan Kopp, Michael Neff, and Gustav Eje Henter. Towards a genea leaderboard–an extended, living benchmark for evaluating and advancing conversational mo- tion synthesis.arXiv preprint arXiv:2410.06327, 2024. 10

work page arXiv 2024
[61]

From audio to photoreal embodiment: Synthesizing humans in conversations

Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, and Alexander Richard. From audio to photoreal embodiment: Synthesizing humans in conversations. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 5, 18, 19, 22

work page 2024
[62]

A comprehensive re- view of data-driven co-speech gesture generation

Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. A comprehensive re- view of data-driven co-speech gesture generation. InCom- puter Graphics Forum, pages 569–596. Wiley Online Li- brary, 2023. 1, 2, 17

work page 2023
[63]

Bodyformer: Semantics-guided 3d body gesture synthesis with transformer.ACM Transactions on Graphics (TOG), 42(4):1–12, 2023

Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki Shiratori, Junichi Yamagishi, and Taku Komura. Bodyformer: Semantics-guided 3d body gesture synthesis with transformer.ACM Transactions on Graphics (TOG), 42(4):1–12, 2023. 3, 5, 20

work page 2023
[64]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 6

work page 2019
[65]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019. 20

work page 2019
[66]

The blizzard challenge 2023

Olivier Perrotin, Brooke Stephenson, Silvain Gerber, and G´erard Bailly. The blizzard challenge 2023. In18th Bliz- zard Challenge Workshop, pages 1–27. ISCA, 2023. 15

work page 2023
[67]

Schae- fer, and Geraint A

Wim Pouw, Shannon Proksch, Linda Drijvers, Marco Gamba, Judith Holler, Christopher Kello, Rebecca S. Schae- fer, and Geraint A. Wiggins. Multilevel rhythms in multi- modal communication.P . Roy. Soc. B, 376(1835), 2021. 9

work page 2021
[68]

Weakly-supervised emotion tran- sition learning for diverse 3d co-speech gesture generation

Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, and Yike Guo. Weakly-supervised emotion tran- sition learning for diverse 3d co-speech gesture generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10424–10434, 2024. 3, 5

work page 2024
[69]

Passing a non-verbal Turing test: Evaluating gesture anima- tions generated from speech

Manuel Rebol, Christian G ¨uti, and Krzysztof Pietroszek. Passing a non-verbal Turing test: Evaluating gesture anima- tions generated from speech. InProceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces, pages 573–581. IEEE, 2021. 4, 7

work page 2021
[70]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 22

work page 2022
[71]

The importance of quali- tative elements in subjective evaluation of semantic gestures

Carolyn Saund and Stacy Marsella. The importance of quali- tative elements in subjective evaluation of semantic gestures. In2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pages 1–8. IEEE,

work page 2021
[72]

Co-speech ges- ture synthesis by reinforcement learning with contrastive pre-trained rewards

Mingyang Sun, Mengchen Zhao, Yaqing Hou, Minglei Li, Huang Xu, Songcen Xu, and Jianye Hao. Co-speech ges- ture synthesis by reinforcement learning with contrastive pre-trained rewards. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2331–2340, 2023. 3

work page 2023
[73]

Speech-to- gesture generation: A challenge in deep learning approach with bi-directional LSTM

Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. Speech-to- gesture generation: A challenge in deep learning approach with bi-directional LSTM. InProceedings of the Interna- tional Conference on Human Agent Interaction, 2017. 2

work page 2017
[74]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the International Conference on Machine Learning, pages 10347–10357. PMLR, 2021. 22

work page 2021
[75]

EDGE: Editable dance generation from music

Jonathan Tseng, Rodrigo Castellon, and Karen Liu. EDGE: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023. 2

work page 2023
[76]

Aq-gt: a temporally aligned and quantized gru-transformer for co-speech gesture synthe- sis

Hendric V oß and Stefan Kopp. Aq-gt: a temporally aligned and quantized gru-transformer for co-speech gesture synthe- sis. InProceedings of the 25th International Conference on Multimodal Interaction, pages 60–69, 2023. 2

work page 2023
[77]

Girard, Taras Kucherenko, and Tony Belpaeme

Pieter Wolfert, Jeffrey M. Girard, Taras Kucherenko, and Tony Belpaeme. To rate or not to rate: Investigating eval- uation methods for generated co-speech gestures. InProc. ICMI, pages 494–502. ACM, 2021. 6

work page 2021
[78]

A re- view of evaluation practices of gesture generation in embod- ied conversational agents.IEEE Transactions on Human- Machine Systems, 52(3):379–389, 2022

Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. A re- view of evaluation practices of gesture generation in embod- ied conversational agents.IEEE Transactions on Human- Machine Systems, 52(3):379–389, 2022. 2

work page 2022
[79]

Probabilistic speech- driven 3d facial motion synthesis: new benchmarks meth- ods and applications

Karren D Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, and Oncel Tuzel. Probabilistic speech- driven 3d facial motion synthesis: new benchmarks meth- ods and applications. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 27294–27303, 2024. 2

work page 2024
[80]

Ges- turehydra: Semantic co-speech gesture synthesis via hybrid modality diffusion transformer and cascaded-synchronized retrieval-augmented generation

Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Lingyun Yu, Yingying Li, Haocheng Feng, Hang Zhou, and Hongtao Xie. Ges- turehydra: Semantic co-speech gesture synthesis via hybrid modality diffusion transformer and cascaded-synchronized retrieval-augmented generation. InProceedings of the 13 IEEE/CVF International Conferen...

work page

Showing first 80 references.

[1] [1]

Generative AI for character animation: A comprehensive survey of tech- niques, applications, and future directions.arXiv preprint arXiv:2504.19056, 2025

Mohammad Mahdi Abootorabi, Omid Ghahroodi, Par- dis Sadat Zahraei, Hossein Behzadasl, Alireza Mirrokni, Mobina Salimipanah, Arash Rasouli, Bahar Behzadipour, Sara Azarnoush, Benyamin Maleki, et al. Generative AI for character animation: A comprehensive survey of tech- niques, applications, and future directions.arXiv preprint arXiv:2504.19056, 2025. 1

work page arXiv 2025

[2] [2]

No gestures left behind: Learning rela- tionships between spoken language and freeform gestures

Chaitanya Ahuja, Dong Won Lee, Ryo Ishii, and Louis- Philippe Morency. No gestures left behind: Learning rela- tionships between spoken language and freeform gestures. InProceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing: Findings, pages 1884– 1895, 2020. 3

work page 2020

[3] [3]

Continual learning for personalized co-speech gesture generation

Chaitanya Ahuja, Pratik Joshi, Ryo Ishii, and Louis-Philippe Morency. Continual learning for personalized co-speech gesture generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 20893–20903, 2023. 3

work page 2023

[4] [4]

Style-controllable speech-driven gesture synthesis using normalising flows

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. Style-controllable speech-driven gesture synthesis using normalising flows. InComputer Graphics Forum, pages 487–496. Wiley Online Library, 2020. 2

work page 2020

[5] [5]

Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023

Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023. 2, 3, 5, 22

work page 2023

[6] [6]

GestureDiffu- CLIP: Gesture diffusion model with CLIP latents.ACM Transactions on Graphics (TOG), 42(4):1–18, 2023

Tenglong Ao, Zeyi Zhang, and Libin Liu. GestureDiffu- CLIP: Gesture diffusion model with CLIP latents.ACM Transactions on Graphics (TOG), 42(4):1–18, 2023. 2, 3, 5

work page 2023

[7] [7]

Kenneth J. Arrow. A difficulty in the concept of social wel- fare.Journal of Political Economy, 58(4):328–346, 1950. 9

work page 1950

[8] [8]

Why spiderman is such a good dancer.https : / / web

Jody Avirgan. Why spiderman is such a good dancer.https : / / web . archive . org / web / 20201112011116/https://www.wnycstudios. org / podcasts / radiolab / articles / 299399 - why-spiderman-such-good-dancer, 2013. 9

work page 2013

[9] [9]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 21

work page 2020

[10] [10]

On the adaptive con- trol of the false discovery rate in multiple testing with inde- 10 pendent statistics.J

Yoav Benjamini and Yosef Hochberg. On the adaptive con- trol of the false discovery rate in multiple testing with inde- 10 pendent statistics.J. Educ. Behav. Stat., 25(1):60–83, 2000. 17

work page 2000

[11] [11]

Elo uncovered: Robustness and best practices in language model evaluation.Advances in Neural Information Processing Systems, 37:106135–106161, 2024

Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, and Marzieh Fadaee. Elo uncovered: Robustness and best practices in language model evaluation.Advances in Neural Information Processing Systems, 37:106135–106161, 2024. 9

work page 2024

[12] [12]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 6, 16

work page 1952

[13] [13]

Towards better user studies in com- puter graphics and vision.Foundations and Trends® in Com- puter Graphics and Vision, 15(3):201–252, 2023

Zoya Bylinskii, Laura Herman, Aaron Hertzmann, Stefanie Hutka, Yile Zhang, et al. Towards better user studies in com- puter graphics and vision.Foundations and Trends® in Com- puter Graphics and Vision, 15(3):201–252, 2023. 1

work page 2023

[14] [14]

A V-Flow: Transforming text to audio-visual human-like interactions.arXiv preprint arXiv:2502.13133,

Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollh ¨ofer, Dimitris Samaras, and Alexan- der Richard. A V-Flow: Transforming text to audio-visual human-like interactions.arXiv preprint arXiv:2502.13133,

work page arXiv

[15] [15]

Motion-example-controlled co-speech ges- ture generation leveraging large language models

Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, and Kun Zhou. Motion-example-controlled co-speech ges- ture generation leveraging large language models. InPro- ceedings of the Special Interest Group on Computer Graph- ics and Interactive Techniques Conference Conference Pa- pers, New York, NY , USA, 2025. Association for Computing Machinery. 3

work page 2025

[16] [16]

The language of motion: Unifying verbal and non-verbal language of 3d human motion

Changan Chen, Juze Zhang, Shrinidhi Kowshika Laksh- mikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei- Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3

work page 2025

[17] [17]

Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and ges- ture generation

Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and ges- ture generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024. 3, 5

work page 2024

[18] [18]

Hop: Heterogeneous topology-based mul- timodal entanglement for co-speech gesture generation

Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, and Yanwei Fu. Hop: Heterogeneous topology-based mul- timodal entanglement for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025. 3, 5

work page 2025

[19] [19]

Siggesture: Gen- eralized co-speech gesture synthesis via semantic injection with large-scale pre-training diffusion models

Qingrong Cheng, Xu Li, and Xinghui Fu. Siggesture: Gen- eralized co-speech gesture synthesis via semantic injection with large-scale pre-training diffusion models. InSIG- GRAPH Asia 2024 Conference Papers, New York, NY , USA,

work page 2024

[20] [20]

Association for Computing Machinery. 3

work page

[21] [21]

HoloGest: Decoupled diffusion and motion priors for generating holisticly expres- sive co-speech gestures

Yongkang Cheng and Shaoli Huang. HoloGest: Decoupled diffusion and motion priors for generating holisticly expres- sive co-speech gestures. InProceedings of the International Conference on 3D Vision, 2025. 7, 8, 22

work page 2025

[22] [22]

Black, and Timo Bolkart

Kiran Chhatre, Radek Dan ˇeˇcek, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, and Timo Bolkart. AMUSE: Emotional speech-driven 3D body ani- mation via disentangled latent diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1942–1953, 2024. 2, 3, 5, 7, 20, 22

work page 1942

[23] [23]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Tim Li, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: New models & Elo system up- date.https://lmsys.org/blog/2023- 12- 07- leaderboard/, 2023. Accessed: 2025-05-20. 16

work page 2023

[24] [24]

Effectively unbiased FID and Inception score and where to find them

Min Jin Chong and David Forsyth. Effectively unbiased FID and Inception score and where to find them. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6070–6079, 2020. 23

work page 2020

[25] [25]

Investigating range- equalizing bias in mean opinion score ratings of synthesized speech

Erica Cooper and Junichi Yamagishi. Investigating range- equalizing bias in mean opinion score ratings of synthesized speech. InProc. Interspeech, pages 1104–1108, 2023. 4

work page 2023

[26] [26]

Advancing objective evaluation of speech-driven gesture generation for embodied conversational agents.International Journal of Human–Computer Interaction, 0(0):1–17, 2025

Karlo Crnek, Grega Mo ˇcnik, and Matej Rojc. Advancing objective evaluation of speech-driven gesture generation for embodied conversational agents.International Journal of Human–Computer Interaction, 0(0):1–17, 2025. 2, 10

work page 2025

[27] [27]

Mofusion: A framework for denoising-diffusion-based motion synthesis

Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion-based motion synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023

[28] [28]

Diffusion-based co-speech gesture genera- tion using joint text and audio representation

Anna Deichler, Shivam Mehta, Simon Alexanderson, and Jonas Beskow. Diffusion-based co-speech gesture genera- tion using joint text and audio representation. InProceedings of the International Conference on Multimodal Interaction, pages 755–762, 2023. 2

work page 2023

[29] [29]

Arpad E. Elo. The proposed USCF rating system, its devel- opment, theory, and applications.Chess Life, 22(8):242–247,

work page

[30] [30]

See- ing is believing: body motion dominates in multisensory conversations.ACM Transactions on Graphics (TOG), 29 (4):1–9, 2010

Cathy Ennis, Rachel McDonnell, and Carol O’Sullivan. See- ing is believing: body motion dominates in multisensory conversations.ACM Transactions on Graphics (TOG), 29 (4):1–9, 2010. 4

work page 2010

[31] [31]

Investigating the use of recurrent motion modelling for speech gesture generation

Ylva Ferstl and Rachel McDonnell. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the ACM International Conference on Intel- ligent Virtual Agents, pages 93–98, 2018. 2, 3

work page 2018

[32] [32]

Zeroeggs: Zero-shot example-based gesture generation from speech

Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F Troje, and Marc-Andr ´e Carbonneau. Zeroeggs: Zero-shot example-based gesture generation from speech. InCom- puter Graphics Forum, pages 206–216. Wiley Online Li- brary, 2023. 2, 3, 21

work page 2023

[33] [33]

Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. InProceedings of the European Conference on Computer Vision, pages 205– 224, 2024. 6

work page 2024

[34] [34]

wild west

Kazi Injamamul Haque, Alkiviadis Pavlou, and Zerrin Yu- mak. “wild west” of evaluating speech-driven 3d facial ani- mation synthesis: A benchmark study. InComputer Graph- ics Forum, page e70073. Wiley Online Library, 2025. 2

work page 2025

[35] [35]

Evaluation of speech-to-gesture generation using bi-directional LSTM network

Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. Evaluation of speech-to-gesture generation using bi-directional LSTM network. InProceed- ings of the ACM International Conference on Intelligent Vir- tual Agents, pages 79–86, New York, NY , USA, 2018. ACM. 2 11

work page 2018

[36] [36]

Automatic quality assessment of speech-driven synthesized gestures.International Journal of Computer Games Technology, 2022, 2022

Zhiyuan He. Automatic quality assessment of speech-driven synthesized gestures.International Journal of Computer Games Technology, 2022, 2022. 10

work page 2022

[37] [37]

The curse of performative user studies

Aaron Hertzmann. The curse of performative user studies. IEEE Computer Graphics and Applications, 43(6):112–116,

work page

[38] [38]

Establishing a uni- fied evaluation framework for human motion generation: A comparative analysis of metrics.Computer Vision and Image Understanding, 254:104337, 2025

Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, and Germain Forestier. Establishing a uni- fied evaluation framework for human motion generation: A comparative analysis of metrics.Computer Vision and Image Understanding, 254:104337, 2025. 10

work page 2025

[39] [39]

Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic set- tings

Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic set- tings. InProceedings of the ACM International Conference on Intelligent Virtual Agents, 2020. 4, 7

work page 2020

[40] [40]

Maurice George Kendall.Rank correlation methods.Griffin,

work page

[41] [41]

Analyzing input and output representations for speech-driven gesture gener- ation

Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellstr ¨om. Analyzing input and output representations for speech-driven gesture gener- ation. InProceedings of the ACM International Conference on Intelligent Virtual Agents, pages 97–104, New York, NY , USA, 2019. ACM. 2

work page 2019

[42] [42]

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Taras Kucherenko, Patrik Jonell, Sanne Van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellstr ¨om. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Mul- timodal Interaction, pages 242–250, 2020

work page 2020

[43] [43]

Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gus- tav Eje Henter, and Hedvig Kjellstr ¨om. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation.Interna- tional Journal of Human-Computer Interaction, 37(14): 1300–1316, 2021. 2

work page 2021

[44] [44]

A large, crowdsourced eval- uation of gesture generation systems on common data: The genea challenge 2020

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. A large, crowdsourced eval- uation of gesture generation systems on common data: The genea challenge 2020. In26th international conference on intelligent user interfaces, pages 11–21, 2021. 2, 4, 15, 20

work page 2020

[45] [45]

The GENEA Challenge 2023: A large- scale evaluation of gesture generation models in monadic and dyadic settings

Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, and Gus- tav Eje Henter. The GENEA Challenge 2023: A large- scale evaluation of gesture generation models in monadic and dyadic settings. InProceedings of the International Con- ference on Multimodal Interaction, pages 792–801, 2023. 4, 7

work page 2023

[46] [46]

Evaluating gesture generation in a large-scale open chal- lenge: The GENEA Challenge 2022.ACM Transactions on Graphics (TOG), 2024

Taras Kucherenko, Pieter Wolfert, Youngwoo Yoon, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Hen- ter. Evaluating gesture generation in a large-scale open chal- lenge: The GENEA Challenge 2022.ACM Transactions on Graphics (TOG), 2024. 2, 4, 7, 15, 19, 23

work page 2022

[47] [47]

Srinivasa, and Yaser Sheikh

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. Talking With Hands 16.2 M: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772,

work page

[48] [48]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3D dance generation with AIST++. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13401–13412, 2021. 22

work page 2021

[49] [49]

BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InProceedings of the European Conference on Computer Vision, pages 612– 630, 2022. 3, 22

work page 2022

[50] [50]

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. EMAGE: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1144–1154, 2024. 1, 3...

work page 2024

[51] [51]

Semges: Semantics-aware co-speech gesture gener- ation using semantic coherence and relevance learning

Lanmiao Liu, Esam Ghaleb, Aslı ¨Ozy¨urek, and Zerrin Yu- mak. Semges: Semantics-aware co-speech gesture gener- ation using semantic coherence and relevance learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 3

work page 2025

[52] [52]

Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling

Pinxin Liu, Luchuan Song, Junhua Huang, and Chenliang Xu. Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025. 3

work page 2025

[53] [53]

Learning hierarchical cross-modal association for co- speech gesture generation

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. Learning hierarchical cross-modal association for co- speech gesture generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022. 2, 22

work page 2022

[54] [54]

Speech-based gesture generation for robots and embodied agents: A scoping review

Yu Liu, Gelareh Mohammadi, Yang Song, and Wafa Johal. Speech-based gesture generation for robots and embodied agents: A scoping review. InProceedings of the Interna- tional Conference on Human-Agent Interaction, pages 31– 38, 2021. 1

work page 2021

[55] [55]

Towards variable and coordinated holistic co-speech motion generation

Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. Towards variable and coordinated holistic co-speech motion generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1566–1576, 2024. 3

work page 2024

[56] [56]

Render me real? investigating the effect of render style on the perception of animated virtual humans.ACM Transac- tions on Graphics (TOG), 31(4):1–11, 2012

Rachel McDonnell, Martin Breidt, and Heinrich H B ¨ulthoff. Render me real? investigating the effect of render style on the perception of animated virtual humans.ACM Transac- tions on Graphics (TOG), 31(4):1–11, 2012. 5

work page 2012

[57] [57]

Miller, Laura A

Jared E. Miller, Laura A. Carlson, and J. Devin McAuley. When what you hear influences when you see: listening to an auditory rhythm influences the temporal allocation of visual attention.Psychological Science, 24(1):11–18, 2013. 9

work page 2013

[58] [58]

Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis

Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, and Christian 12 Theobalt. Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 7, 8, 21

work page 2024

[59] [59]

Hamza Mughal, Rishabh Dabral, Merel C

M. Hamza Mughal, Rishabh Dabral, Merel C. J. Scholman, Vera Demberg, and Christian Theobalt. Retrieving semantics from the deep: an rag solution for gesture synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3, 4, 5, 7, 8, 17, 21

work page 2025

[60] [60]

Towards a genea leaderboard–an extended, living benchmark for evaluating and advancing conversational mo- tion synthesis.arXiv preprint arXiv:2410.06327, 2024

Rajmund Nagy, Hendric V oss, Youngwoo Yoon, Taras Kucherenko, Teodor Nikolov, Thanh Hoang-Minh, Rachel McDonnell, Stefan Kopp, Michael Neff, and Gustav Eje Henter. Towards a genea leaderboard–an extended, living benchmark for evaluating and advancing conversational mo- tion synthesis.arXiv preprint arXiv:2410.06327, 2024. 10

work page arXiv 2024

[61] [61]

From audio to photoreal embodiment: Synthesizing humans in conversations

Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, and Alexander Richard. From audio to photoreal embodiment: Synthesizing humans in conversations. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 2, 3, 5, 18, 19, 22

work page 2024

[62] [62]

A comprehensive re- view of data-driven co-speech gesture generation

Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. A comprehensive re- view of data-driven co-speech gesture generation. InCom- puter Graphics Forum, pages 569–596. Wiley Online Li- brary, 2023. 1, 2, 17

work page 2023

[63] [63]

Bodyformer: Semantics-guided 3d body gesture synthesis with transformer.ACM Transactions on Graphics (TOG), 42(4):1–12, 2023

Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki Shiratori, Junichi Yamagishi, and Taku Komura. Bodyformer: Semantics-guided 3d body gesture synthesis with transformer.ACM Transactions on Graphics (TOG), 42(4):1–12, 2023. 3, 5, 20

work page 2023

[64] [64]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 6

work page 2019

[65] [65]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019. 20

work page 2019

[66] [66]

The blizzard challenge 2023

Olivier Perrotin, Brooke Stephenson, Silvain Gerber, and G´erard Bailly. The blizzard challenge 2023. In18th Bliz- zard Challenge Workshop, pages 1–27. ISCA, 2023. 15

work page 2023

[67] [67]

Schae- fer, and Geraint A

Wim Pouw, Shannon Proksch, Linda Drijvers, Marco Gamba, Judith Holler, Christopher Kello, Rebecca S. Schae- fer, and Geraint A. Wiggins. Multilevel rhythms in multi- modal communication.P . Roy. Soc. B, 376(1835), 2021. 9

work page 2021

[68] [68]

Weakly-supervised emotion tran- sition learning for diverse 3d co-speech gesture generation

Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, and Yike Guo. Weakly-supervised emotion tran- sition learning for diverse 3d co-speech gesture generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10424–10434, 2024. 3, 5

work page 2024

[69] [69]

Passing a non-verbal Turing test: Evaluating gesture anima- tions generated from speech

Manuel Rebol, Christian G ¨uti, and Krzysztof Pietroszek. Passing a non-verbal Turing test: Evaluating gesture anima- tions generated from speech. InProceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces, pages 573–581. IEEE, 2021. 4, 7

work page 2021

[70] [70]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 22

work page 2022

[71] [71]

The importance of quali- tative elements in subjective evaluation of semantic gestures

Carolyn Saund and Stacy Marsella. The importance of quali- tative elements in subjective evaluation of semantic gestures. In2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pages 1–8. IEEE,

work page 2021

[72] [72]

Co-speech ges- ture synthesis by reinforcement learning with contrastive pre-trained rewards

Mingyang Sun, Mengchen Zhao, Yaqing Hou, Minglei Li, Huang Xu, Songcen Xu, and Jianye Hao. Co-speech ges- ture synthesis by reinforcement learning with contrastive pre-trained rewards. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2331–2340, 2023. 3

work page 2023

[73] [73]

Speech-to- gesture generation: A challenge in deep learning approach with bi-directional LSTM

Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. Speech-to- gesture generation: A challenge in deep learning approach with bi-directional LSTM. InProceedings of the Interna- tional Conference on Human Agent Interaction, 2017. 2

work page 2017

[74] [74]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the International Conference on Machine Learning, pages 10347–10357. PMLR, 2021. 22

work page 2021

[75] [75]

EDGE: Editable dance generation from music

Jonathan Tseng, Rodrigo Castellon, and Karen Liu. EDGE: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023. 2

work page 2023

[76] [76]

Aq-gt: a temporally aligned and quantized gru-transformer for co-speech gesture synthe- sis

Hendric V oß and Stefan Kopp. Aq-gt: a temporally aligned and quantized gru-transformer for co-speech gesture synthe- sis. InProceedings of the 25th International Conference on Multimodal Interaction, pages 60–69, 2023. 2

work page 2023

[77] [77]

Girard, Taras Kucherenko, and Tony Belpaeme

Pieter Wolfert, Jeffrey M. Girard, Taras Kucherenko, and Tony Belpaeme. To rate or not to rate: Investigating eval- uation methods for generated co-speech gestures. InProc. ICMI, pages 494–502. ACM, 2021. 6

work page 2021

[78] [78]

A re- view of evaluation practices of gesture generation in embod- ied conversational agents.IEEE Transactions on Human- Machine Systems, 52(3):379–389, 2022

Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. A re- view of evaluation practices of gesture generation in embod- ied conversational agents.IEEE Transactions on Human- Machine Systems, 52(3):379–389, 2022. 2

work page 2022

[79] [79]

Probabilistic speech- driven 3d facial motion synthesis: new benchmarks meth- ods and applications

Karren D Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, and Oncel Tuzel. Probabilistic speech- driven 3d facial motion synthesis: new benchmarks meth- ods and applications. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 27294–27303, 2024. 2

work page 2024

[80] [80]

Ges- turehydra: Semantic co-speech gesture synthesis via hybrid modality diffusion transformer and cascaded-synchronized retrieval-augmented generation

Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Lingyun Yu, Yingying Li, Haocheng Feng, Hang Zhou, and Hongtao Xie. Ges- turehydra: Semantic co-speech gesture synthesis via hybrid modality diffusion transformer and cascaded-synchronized retrieval-augmented generation. InProceedings of the 13 IEEE/CVF International Conferen...

work page