Recognizing Co-Speech Gestures in-the-Wild

Andrew Zisserman; K R Prajwal; Sindhu B Hegde

arxiv: 2605.31589 · v1 · pith:PAFMHWQSnew · submitted 2026-05-29 · 💻 cs.CV

Recognizing Co-Speech Gestures in-the-Wild

Sindhu B Hegde , K R Prajwal , Andrew Zisserman This is my paper

Pith reviewed 2026-06-28 23:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords co-speech gesturesgesture recognitionvideo datasetmultimodal modelssemantic gesturestemporal localizationGRW dataset

0 comments

The pith

The GRW dataset is the first large-scale benchmark for recognizing semantic co-speech gestures with word mappings and precise timing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Gesture Recognition in the Wild dataset to address the lack of annotated data for semantic co-speech gestures. It contains over 156,000 video clips annotated with frame-accurate boundaries for 150 words. This allows training models to classify if a gesture is semantic, identify the word it corresponds to, and locate it in time. A sympathetic reader would care because it removes a key bottleneck for multimodal models to understand natural human communication involving gestures.

Core claim

The central discovery is the creation of the GRW dataset comprising 156,688 manually annotated video clips that map unconstrained human gestures to a 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts, with frame-accurate temporal boundaries, enabling benchmarks for classifying semantic gestures, word recognition, and temporal localization.

What carries the argument

The GRW dataset, which provides manually annotated video clips linking gestures to specific words with precise start and end frames.

If this is right

Video models can be trained to classify whether a gesture is semantic or not.
Models can recognize the specific word corresponding to a co-speech gesture.
Models can temporally localize the gesture within the video.
Benchmarks are established for these three tasks using the dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This dataset could improve the performance of virtual assistants or video analysis tools in understanding contextual gestures.
It opens the possibility for integrating gesture recognition into real-time speech translation systems.
Future work might extend the taxonomy to include more words or different languages.

Load-bearing premise

The manual annotations are sufficiently accurate, consistent, and representative of real-world co-speech gestures across diverse videos and the 150-word taxonomy.

What would settle it

A study showing that different annotators disagree significantly on the gesture boundaries or word mappings, or that models trained on GRW fail to generalize to new videos not in the dataset.

Figures

Figures reproduced from arXiv: 2605.31589 by Andrew Zisserman, K R Prajwal, Sindhu B Hegde.

**Figure 1.** Figure 1: Semantic vs. non-semantic co-speech gestures. (Top) Obama performs an iconic semantic gesture for the word “massive”. Notice the significant temporal offset between the physical gesture and the spoken word. (Bottom) There are no semantic gestures around the word “beginning”. This paper focuses on building a dataset and models that can automatically recognize and localize semantic gestures in real-world cli… view at source ↗

**Figure 2.** Figure 2: Samples from the GRW dataset. (a) Lexical and kinematic diversity of semantic gestures. (b) Even when gesturing the same conceptual word (“SPIRAL”), speakers employ different physical motions. The dataset captures these gestures across a wide variety of speakers, poses, and camera angles. The GRW dataset comprises 156,688 manually annotated video clips curated from diverse, in-the-wild speaker environment… view at source ↗

**Figure 3.** Figure 3: Semantic Taxonomy of the GRW Dataset. We organize our 150- word vocabulary into a three-tiered hierarchy, radiating from five high-level semantic domains to specific target words (leaf nodes). Node size is strictly proportional to the frequency of annotated clips per word. 4 Dataset Curation Pipeline In this section we describe the pipeline used to curate the GRW word-level semantic gesture dataset and … view at source ↗

**Figure 4.** Figure 4: shows the proportion of gestured instances among all clips submitted for annotation. The likelihood that a word is accompanied by a semantically meaningful gesture varies dramatically across the lexicon. While some words are gestured very frequently, for example, ‘bye’ is gestured approximately 66%, while others are rarely depicted. At the extreme low end, ‘look’ is gestured in less than 1% of cases. This … view at source ↗

**Figure 5.** Figure 5: Gesture-speech temporal alignment. Semantic gestures rarely align perfectly with spoken words. As shown in (a), gestures are significantly longer in duration than speech. (b) The vast majority of gestures start before (96.6%) and end after (89.7%) the target word is spoken. (c) Activation heatmaps (aggregated across all samples for a specific word) visually confirm this “envelope” effect. 6 Recognizing Co… view at source ↗

**Figure 6.** Figure 6: (a) A binary semantic gesture classifier for a query clip. The model uses crossattention between a target query clip and a broader temporal motion context. (b) The word recognition and localization model is trained in two stages: first, pre-trained on auto-generated pseudo-labels with weak boundary supervision, and then fine-tuned on manually annotated clean data for precise temporal localization. 6.2 Sem… view at source ↗

**Figure 7.** Figure 7: We visualize sample predictions for joint word recognition and localization. In both instances, our model correctly assigns the target word (together, bye) and closely matches the ground truth temporal bounds. Notably, the right example illustrates our model’s ability to accurately localize anticipatory gestures that occur well before the corresponding speech is uttered [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 8.** Figure 8: Data samples from the GRW dataset. The dataset encompasses a diverse vocabulary of 150 conceptual words, ranging from numbers (‘two’) and spatial descriptors (‘little’, ‘back’) to physical actions (‘grasp’, ‘stack’) and conversational anchors (‘hello’). For each positive instance, we provide precise, frame-accurate temporal boundaries marking the start and end of the semantic gesture (highlighted by the … view at source ↗

**Figure 9.** Figure 9: Stage-1: Semantic Gesture Annotation Interface. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Stage-2: Temporal Boundary Annotation Interface. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Analysis of typical failure cases. The word recognition model occasionally misclassifies gestures into semantically and visually synonymous categories. (Top) The model predicts “raise” instead of the ground-truth “peak” as the speaker moves his hand upward. (Bottom) A classic finger-pinching motion results in a prediction of “tiny” rather than the ground-truth “little”. These examples highlight the inhere… view at source ↗

read the original abstract

While humans naturally gesture during speech, only a sparse subset of these movements are visually depictive and semantically linked to specific spoken words. Current multimodal models struggle to capture these semantic co-speech gestures, heavily bottlenecked by a lack of precisely annotated training data. To address this, we introduce the Gesture Recognition in the Wild (GRW) dataset, the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. Comprising 156,688 manually annotated video clips, GRW spans a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts. We leverage GRW to train video models to (a) classify gestures as semantic or not, (b) recognize the word corresponding to a co-speech gesture, and (c) temporally localize the gesture. We also use GRW to establish benchmarks for these three tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRW gives scale and task definitions for co-speech gesture work but supplies no evidence on annotation quality or consistency.

read the letter

The main takeaway is that this paper releases the GRW dataset of 156k clips with a 150-word taxonomy and frame-accurate boundaries, plus three defined tasks for semantic classification, word recognition, and temporal localization. It targets the gap where multimodal models lack training data for gestures that actually carry word-level meaning.

The contribution is straightforward: it collects a larger and more granular set of examples than prior gesture datasets and spells out concrete benchmarks. That setup could be useful for anyone training video models on human communication.

The soft spot is the complete absence of any numbers on how the annotations were produced or checked. The abstract says the clips are manually annotated but gives no protocol, no inter-annotator agreement, no boundary precision stats, and no baseline model results. Without those, the claim that the data supports reliable benchmarks rests on an untested assumption. If the full paper adds those checks and some performance numbers, the work becomes more solid; right now the central claim is hard to evaluate.

This is for computer vision and multimodal researchers who need gesture data. A reader building models that should understand co-speech gestures might pull the taxonomy or tasks if the annotations prove consistent.

I would send it to peer review. Dataset papers can be worth referee time when the scale is this large, even if they need extra validation sections.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that the GRW dataset is the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. It comprises 156,688 manually annotated video clips spanning a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts, and is used to define and benchmark three tasks: classifying gestures as semantic or not, recognizing the corresponding word, and temporally localizing the gesture.

Significance. If the annotations are shown to be accurate and consistent, the GRW dataset would address a clear data gap for semantic co-speech gesture understanding in multimodal models, providing scale and diversity that could support reproducible benchmarks across the three tasks.

major comments (2)

[Abstract] Abstract: the central claim that GRW supplies 'frame-accurate temporal boundaries' for the three tasks rests on the quality of the manual annotations, yet the abstract supplies no annotation protocol, number of annotators per clip, adjudication process, or quantitative agreement statistics (e.g., temporal IoU or label kappa).
[Abstract] Abstract: without reported inter-annotator agreement or boundary-precision metrics, the weakest assumption—that the 156,688 clips are sufficiently accurate and representative—remains unverified, rendering the benchmark claims for semantic classification, word recognition, and localization unevaluable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. We agree that the abstract should more explicitly support the annotation quality underlying the 'frame-accurate' claim and will revise it accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that GRW supplies 'frame-accurate temporal boundaries' for the three tasks rests on the quality of the manual annotations, yet the abstract supplies no annotation protocol, number of annotators per clip, adjudication process, or quantitative agreement statistics (e.g., temporal IoU or label kappa).

Authors: We agree that the abstract would be stronger if it briefly referenced the annotation protocol. The full manuscript (Section 4) describes the multi-annotator process and adjudication. In revision we will add one concise sentence to the abstract summarizing the protocol, annotator count, and agreement statistics. revision: yes
Referee: [Abstract] Abstract: without reported inter-annotator agreement or boundary-precision metrics, the weakest assumption—that the 156,688 clips are sufficiently accurate and representative—remains unverified, rendering the benchmark claims for semantic classification, word recognition, and localization unevaluable.

Authors: We acknowledge the point: the current abstract does not report these metrics, making the claims harder to evaluate at a glance. The manuscript body provides annotation details; we will revise the abstract to include a short statement on inter-annotator agreement and boundary precision so the benchmark claims are more directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset paper with no derivations or fitted predictions.

full rationale

The paper introduces the GRW dataset of 156,688 manually annotated clips and defines three downstream tasks (semantic classification, word recognition, temporal localization). No equations, parameters, or closed-form claims appear anywhere in the provided text. The contribution is a new benchmark definition rather than a derivation that reduces to its own inputs. No self-citations, ansatzes, or uniqueness theorems are invoked to support any mathematical result. Annotation quality concerns (protocol, agreement metrics) affect verifiability but do not constitute circularity under the defined patterns, as there is no reduction of a claimed prediction to a fitted input or self-referential definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the unverified premise that manual annotation yields reliable frame-accurate labels; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)

domain assumption Manual annotations provide accurate frame-accurate temporal boundaries for gestures.
The utility of GRW for training and benchmarking the three tasks depends directly on this premise.

pith-pipeline@v0.9.1-grok · 5681 in / 1169 out tokens · 25830 ms · 2026-06-28T23:12:50.942201+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:2506.22554 (2025)

Agrawal, V., Akinyemi, A., Alvero, K., Behrooz, M., Buffalini, J., Carlucci, F.M., Chen, J., Chen, J., Chen, Z., Cheng, S., et al.: Seamless interaction: Dyadic audio- visual motion modeling and large-scale dataset. arXiv preprint arXiv:2506.22554 (2025)

work page arXiv 2025
[2]

In: Findings of the association for computational linguistics: EMNLP 2020

Ahuja, C., Lee, D.W., Ishii, R., Morency, L.P.: No gestures left behind: Learning relationships between spoken language and freeform gestures. In: Findings of the association for computational linguistics: EMNLP 2020. pp. 1884–1895 (2020)

2020
[3]

arXiv (2021)

Albanie, S., Varol, G., Momeni, L., Bull, H., Afouras, T., Chowdhury, H., Fox, N., Woll, B., Cooper, R., McParland, A., Zisserman, A.: BOBSL: BBC-Oxford British Sign Language Dataset. arXiv (2021)

2021
[4]

Alexanderson, S., Henter, G.E., Kucherenko, T., Beskow, J.: Style-controllable speech-driven gesture synthesis using normalising flows39(2), 487–496 (2020)

2020
[5]

ACM Transac- tions on Graphics (TOG)41(6), 1–19 (2022)

Ao, T., Gao, Q., Lou, Y., Chen, B., Liu, L.: Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transac- tions on Graphics (TOG)41(6), 1–19 (2022)

2022
[6]

Personality and social psychology bulletin21(4), 394–405 (1995)

Bavelas, J.B., Chovil, N., Coates, L., Roe, L.: Gestures specialized for dialogue. Personality and social psychology bulletin21(4), 394–405 (1995)

1995
[7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., Torres, J., Giro-i Nieto, X.: How2sign: a large-scale multimodal dataset for con- tinuous american sign language. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2735–2744 (2021)

2021
[8]

In: Proceedings of the 27th ACM International Conference on Multimedia

Dutta, A., Zisserman, A.: The VIA annotation software for images, audio and video. In: Proceedings of the 27th ACM International Conference on Multimedia. MM ’19, ACM, New York, NY, USA (2019).https://doi.org/10.1145/3343031. 3350535,https://doi.org/10.1145/3343031.3350535

work page doi:10.1145/3343031 2019
[9]

ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Free- man, W.T., Rubinstein, M.: Looking to listen at the cocktail party: a speaker- independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

2018
[10]

In: Proceedings of the 18th international conference on intelligent virtual agents

Ferstl, Y., McDonnell, R.: Investigating the use of recurrent motion modelling for speech gesture generation. In: Proceedings of the 18th international conference on intelligent virtual agents. pp. 93–98 (2018)

2018
[11]

Computer Animation and Virtual Worlds 32(3-4), e2016 (2021) 16 S Hegde et al

Ferstl, Y., Neff, M., McDonnell, R.: Expressgesture: Expressive gesture generation from speech through database matching. Computer Animation and Virtual Worlds 32(3-4), e2016 (2021) 16 S Hegde et al

2021
[12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individ- ual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3497–3506 (2019)

2019
[13]

Google: Gemini 3.https://blog.google/products-and-platforms/products/ gemini/gemini-3/(2026), accessed: 2026-03-06

2026
[14]

In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) (2025)

Gueuwou, S., Du, X., Shakhnarovich, G., Livescu, K., Liu, A.H.: Shubert: Self- supervised sign language representation learning via multi-stream cluster predic- tion. In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) (2025)

2025
[15]

In: BMVC (2023)

Hegde, S., Zisserman, A.: Gestsync: Determining who is speaking without a talking head. In: BMVC (2023)

2023
[16]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Hegde, S.B., Prajwal, K., Kwon, T., Zisserman, A.: Understanding co-speech ges- tures in-the-wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9977–9987 (2025)

2025
[17]

arXiv preprint arXiv:2602.09146 (2026)

Huberman, S., Goldberg, K., Patashnik, O., Benaim, S., Mokady, R.: Semanticmo- ments: Training-free motion similarity via third moment features. arXiv preprint arXiv:2602.09146 (2026)

work page arXiv 2026
[18]

In: 2019 14th IEEE interna- tional conference on automatic face & gesture recognition (FG 2019)

Köpüklü, O., Gunduz, A., Kose, N., Rigoll, G.: Real-time hand gesture detection and classification using convolutional neural networks. In: 2019 14th IEEE interna- tional conference on automatic face & gesture recognition (FG 2019). IEEE (2019)

2019
[19]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lee, G., Deng, Z., Ma, S., Shiratori, T., Srinivasa, S.S., Sheikh, Y.: Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 763–772 (2019)

2019
[20]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Li, D., Rodriguez, C., Yu, X., Li, H.: Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1459– 1469 (2020)

2020
[21]

In: Proceedings of the 2024 conference on empirical methods in natural language processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)

2024
[22]

In: European conference on computer vision

Liu, H., Zhu, Z., Iwamoto, N., Peng, Y., Li, Z., Zhou, Y., Bozkurt, E., Zheng, B.: Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis. In: European conference on computer vision. pp. 612–630. Springer (2022)

2022
[23]

MediaPipe: A Framework for Building Perception Pipelines

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[24]

Neuro- computing508, 293–304 (2022)

Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neuro- computing508, 293–304 (2022)

2022
[25]

In: Proceedings of the IEEE/CVF interna- tional conference on computer vision workshops (2019)

Materzynska, J., Berger, G., Bax, I., Memisevic, R.: The jester dataset: A large- scale video dataset of human gestures. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision workshops (2019)

2019
[26]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4207–4215 (2016) GRW 17

2016
[27]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Mughal, M.H., Dabral, R., Habibie, I., Donatelli, L., Habermann, M., Theobalt, C.:Convofusion:Multi-modalconversationaldiffusionforco-speechgesturesynthe- sis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1388–1398 (2024)

2024
[28]

Nyatsanga, S., Kucherenko, T., Ahuja, C., Henter, G.E., Neff, M.: A comprehensive review of data-driven co-speech gesture generation42(2), 569–596 (2023)

2023
[29]

In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Prajwal, K., Hegde, S., Zisserman, A.: Scaling multilingual visual speech recogni- tion. In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2025)

2025
[30]

IEEE Trans

Quiros, L.C., Tax, D.M.J., Hung, H.: Gestures in-the-wild: Detecting conversa- tional hand gestures in crowded scenes using a multimodal fusion of bags of video trajectories and body worn acceleration. IEEE Trans. Multim.22(1), 138–147 (2020).https://doi.org/10.1109/TMM.2019.2922122,https://doi.org/10. 1109/TMM.2019.2922122

work page doi:10.1109/tmm.2019.2922122 2020
[31]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Su, J., Lu, Y., Pan, S., Wen, B., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. CoRRabs/2104.09864(2021),https://arxiv.org/ abs/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Advances in Neural Information Pro- cessing Systems pp

Uthus, D., Tanzer, G., Georg, M.: Youtube-asl: A large-scale, open-domain amer- ican sign language-english parallel corpus. Advances in Neural Information Pro- cessing Systems pp. 29029–29047 (2023)

2023
[33]

In: Proceed- ings of the IEEE conference on computer vision and pattern recognition workshops

Wan, J., Zhao, Y., Zhou, S., Guyon, I., Escalera, S., Li, S.Z.: Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceed- ings of the IEEE conference on computer vision and pattern recognition workshops. pp. 56–64 (2016)

2016
[34]

In: European conference on computer vision

Wang, C.Y., Yeh, I.H., Mark Liao, H.Y.: Yolov9: Learning what you want to learn using programmable gradient information. In: European conference on computer vision. pp. 1–21. Springer (2024)

2024
[35]

In: European Conference on Computer Vision

Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Yin, H., Chen, J., Jin, T., Wu, J., et al.: Internvideo2: Scaling video foundation models for multimodal video understanding. In: European Conference on Computer Vision. Springer (2024)

2024
[36]

IEEE Transactions on Multimedia pp

Zhang, Y., Cao, C., Cheng, J., Lu, H.: Egogesture: A new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia pp. 1038–1050 (2018)

2018
[37]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J.,Li,Z.,etal.:Languagebind:Extendingvideo-languagepretrainington-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852 (2023) 18 S Hegde et al. A Dataset A.1 Data samples Fig 8 illustrates representative samples from the semantic subset of our man...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

No semantic gestures present in the video
[39]

Semantic gesture may be present, but is not clear and/or is not fully visible
[40]

Yes” (gesture found) or “No

Semantic gesture present, and is clearly visible. –Make a decision: “Yes” (gesture found) or “No” (gesture not found). The decision will be “Yes” only if category 3 occurs as explained above. –Give the confidence score between 0 and 1, 0 being the least confident and 1 being the most confident. –Provide a short reasoning. Output Format: Gesture present: <...
[41]

Analyze all the frames of the video together and produce one final gesture description that best represents what the hands are gesturing throughout these frames
[42]

Based on the description, rank a given set of 100 possible word classes with confidence scores (between 0 and 1) such that they sum to approximately 1.0
[43]

raising hand then pointing forward

Predict the start and end time of the gesture in seconds (gesture boundary) for the recognized target word. Classes:bye, below, push, move, direction, specific, together, whole, hello, straight, above, switch, five, turn, expand, throw, open, huge, raise, large, tiny, long, mix, circle, no, two, three, entire, top, four, lower, full, press, small, grab, g...

[1] [1]

arXiv preprint arXiv:2506.22554 (2025)

Agrawal, V., Akinyemi, A., Alvero, K., Behrooz, M., Buffalini, J., Carlucci, F.M., Chen, J., Chen, J., Chen, Z., Cheng, S., et al.: Seamless interaction: Dyadic audio- visual motion modeling and large-scale dataset. arXiv preprint arXiv:2506.22554 (2025)

work page arXiv 2025

[2] [2]

In: Findings of the association for computational linguistics: EMNLP 2020

Ahuja, C., Lee, D.W., Ishii, R., Morency, L.P.: No gestures left behind: Learning relationships between spoken language and freeform gestures. In: Findings of the association for computational linguistics: EMNLP 2020. pp. 1884–1895 (2020)

2020

[3] [3]

arXiv (2021)

Albanie, S., Varol, G., Momeni, L., Bull, H., Afouras, T., Chowdhury, H., Fox, N., Woll, B., Cooper, R., McParland, A., Zisserman, A.: BOBSL: BBC-Oxford British Sign Language Dataset. arXiv (2021)

2021

[4] [4]

Alexanderson, S., Henter, G.E., Kucherenko, T., Beskow, J.: Style-controllable speech-driven gesture synthesis using normalising flows39(2), 487–496 (2020)

2020

[5] [5]

ACM Transac- tions on Graphics (TOG)41(6), 1–19 (2022)

Ao, T., Gao, Q., Lou, Y., Chen, B., Liu, L.: Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transac- tions on Graphics (TOG)41(6), 1–19 (2022)

2022

[6] [6]

Personality and social psychology bulletin21(4), 394–405 (1995)

Bavelas, J.B., Chovil, N., Coates, L., Roe, L.: Gestures specialized for dialogue. Personality and social psychology bulletin21(4), 394–405 (1995)

1995

[7] [7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., Torres, J., Giro-i Nieto, X.: How2sign: a large-scale multimodal dataset for con- tinuous american sign language. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2735–2744 (2021)

2021

[8] [8]

In: Proceedings of the 27th ACM International Conference on Multimedia

Dutta, A., Zisserman, A.: The VIA annotation software for images, audio and video. In: Proceedings of the 27th ACM International Conference on Multimedia. MM ’19, ACM, New York, NY, USA (2019).https://doi.org/10.1145/3343031. 3350535,https://doi.org/10.1145/3343031.3350535

work page doi:10.1145/3343031 2019

[9] [9]

ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Free- man, W.T., Rubinstein, M.: Looking to listen at the cocktail party: a speaker- independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

2018

[10] [10]

In: Proceedings of the 18th international conference on intelligent virtual agents

Ferstl, Y., McDonnell, R.: Investigating the use of recurrent motion modelling for speech gesture generation. In: Proceedings of the 18th international conference on intelligent virtual agents. pp. 93–98 (2018)

2018

[11] [11]

Computer Animation and Virtual Worlds 32(3-4), e2016 (2021) 16 S Hegde et al

Ferstl, Y., Neff, M., McDonnell, R.: Expressgesture: Expressive gesture generation from speech through database matching. Computer Animation and Virtual Worlds 32(3-4), e2016 (2021) 16 S Hegde et al

2021

[12] [12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individ- ual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3497–3506 (2019)

2019

[13] [13]

Google: Gemini 3.https://blog.google/products-and-platforms/products/ gemini/gemini-3/(2026), accessed: 2026-03-06

2026

[14] [14]

In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) (2025)

Gueuwou, S., Du, X., Shakhnarovich, G., Livescu, K., Liu, A.H.: Shubert: Self- supervised sign language representation learning via multi-stream cluster predic- tion. In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) (2025)

2025

[15] [15]

In: BMVC (2023)

Hegde, S., Zisserman, A.: Gestsync: Determining who is speaking without a talking head. In: BMVC (2023)

2023

[16] [16]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Hegde, S.B., Prajwal, K., Kwon, T., Zisserman, A.: Understanding co-speech ges- tures in-the-wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9977–9987 (2025)

2025

[17] [17]

arXiv preprint arXiv:2602.09146 (2026)

Huberman, S., Goldberg, K., Patashnik, O., Benaim, S., Mokady, R.: Semanticmo- ments: Training-free motion similarity via third moment features. arXiv preprint arXiv:2602.09146 (2026)

work page arXiv 2026

[18] [18]

In: 2019 14th IEEE interna- tional conference on automatic face & gesture recognition (FG 2019)

Köpüklü, O., Gunduz, A., Kose, N., Rigoll, G.: Real-time hand gesture detection and classification using convolutional neural networks. In: 2019 14th IEEE interna- tional conference on automatic face & gesture recognition (FG 2019). IEEE (2019)

2019

[19] [19]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lee, G., Deng, Z., Ma, S., Shiratori, T., Srinivasa, S.S., Sheikh, Y.: Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 763–772 (2019)

2019

[20] [20]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Li, D., Rodriguez, C., Yu, X., Li, H.: Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1459– 1469 (2020)

2020

[21] [21]

In: Proceedings of the 2024 conference on empirical methods in natural language processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)

2024

[22] [22]

In: European conference on computer vision

Liu, H., Zhu, Z., Iwamoto, N., Peng, Y., Li, Z., Zhou, Y., Bozkurt, E., Zheng, B.: Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis. In: European conference on computer vision. pp. 612–630. Springer (2022)

2022

[23] [23]

MediaPipe: A Framework for Building Perception Pipelines

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[24] [24]

Neuro- computing508, 293–304 (2022)

Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neuro- computing508, 293–304 (2022)

2022

[25] [25]

In: Proceedings of the IEEE/CVF interna- tional conference on computer vision workshops (2019)

Materzynska, J., Berger, G., Bax, I., Memisevic, R.: The jester dataset: A large- scale video dataset of human gestures. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision workshops (2019)

2019

[26] [26]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4207–4215 (2016) GRW 17

2016

[27] [27]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Mughal, M.H., Dabral, R., Habibie, I., Donatelli, L., Habermann, M., Theobalt, C.:Convofusion:Multi-modalconversationaldiffusionforco-speechgesturesynthe- sis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1388–1398 (2024)

2024

[28] [28]

Nyatsanga, S., Kucherenko, T., Ahuja, C., Henter, G.E., Neff, M.: A comprehensive review of data-driven co-speech gesture generation42(2), 569–596 (2023)

2023

[29] [29]

In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Prajwal, K., Hegde, S., Zisserman, A.: Scaling multilingual visual speech recogni- tion. In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2025)

2025

[30] [30]

IEEE Trans

Quiros, L.C., Tax, D.M.J., Hung, H.: Gestures in-the-wild: Detecting conversa- tional hand gestures in crowded scenes using a multimodal fusion of bags of video trajectories and body worn acceleration. IEEE Trans. Multim.22(1), 138–147 (2020).https://doi.org/10.1109/TMM.2019.2922122,https://doi.org/10. 1109/TMM.2019.2922122

work page doi:10.1109/tmm.2019.2922122 2020

[31] [31]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Su, J., Lu, Y., Pan, S., Wen, B., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. CoRRabs/2104.09864(2021),https://arxiv.org/ abs/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [32]

Advances in Neural Information Pro- cessing Systems pp

Uthus, D., Tanzer, G., Georg, M.: Youtube-asl: A large-scale, open-domain amer- ican sign language-english parallel corpus. Advances in Neural Information Pro- cessing Systems pp. 29029–29047 (2023)

2023

[33] [33]

In: Proceed- ings of the IEEE conference on computer vision and pattern recognition workshops

Wan, J., Zhao, Y., Zhou, S., Guyon, I., Escalera, S., Li, S.Z.: Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceed- ings of the IEEE conference on computer vision and pattern recognition workshops. pp. 56–64 (2016)

2016

[34] [34]

In: European conference on computer vision

Wang, C.Y., Yeh, I.H., Mark Liao, H.Y.: Yolov9: Learning what you want to learn using programmable gradient information. In: European conference on computer vision. pp. 1–21. Springer (2024)

2024

[35] [35]

In: European Conference on Computer Vision

Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Yin, H., Chen, J., Jin, T., Wu, J., et al.: Internvideo2: Scaling video foundation models for multimodal video understanding. In: European Conference on Computer Vision. Springer (2024)

2024

[36] [36]

IEEE Transactions on Multimedia pp

Zhang, Y., Cao, C., Cheng, J., Lu, H.: Egogesture: A new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia pp. 1038–1050 (2018)

2018

[37] [37]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J.,Li,Z.,etal.:Languagebind:Extendingvideo-languagepretrainington-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852 (2023) 18 S Hegde et al. A Dataset A.1 Data samples Fig 8 illustrates representative samples from the semantic subset of our man...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

No semantic gestures present in the video

[39] [39]

Semantic gesture may be present, but is not clear and/or is not fully visible

[40] [40]

Yes” (gesture found) or “No

Semantic gesture present, and is clearly visible. –Make a decision: “Yes” (gesture found) or “No” (gesture not found). The decision will be “Yes” only if category 3 occurs as explained above. –Give the confidence score between 0 and 1, 0 being the least confident and 1 being the most confident. –Provide a short reasoning. Output Format: Gesture present: <...

[41] [41]

Analyze all the frames of the video together and produce one final gesture description that best represents what the hands are gesturing throughout these frames

[42] [42]

Based on the description, rank a given set of 100 possible word classes with confidence scores (between 0 and 1) such that they sum to approximately 1.0

[43] [43]

raising hand then pointing forward

Predict the start and end time of the gesture in seconds (gesture boundary) for the recognized target word. Classes:bye, below, push, move, direction, specific, together, whole, hello, straight, above, switch, five, turn, expand, throw, open, huge, raise, large, tiny, long, mix, circle, no, two, three, entire, top, four, lower, full, press, small, grab, g...