pith. sign in

arxiv: 2605.31589 · v1 · pith:PAFMHWQSnew · submitted 2026-05-29 · 💻 cs.CV

Recognizing Co-Speech Gestures in-the-Wild

Pith reviewed 2026-06-28 23:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords co-speech gesturesgesture recognitionvideo datasetmultimodal modelssemantic gesturestemporal localizationGRW dataset
0
0 comments X

The pith

The GRW dataset is the first large-scale benchmark for recognizing semantic co-speech gestures with word mappings and precise timing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Gesture Recognition in the Wild dataset to address the lack of annotated data for semantic co-speech gestures. It contains over 156,000 video clips annotated with frame-accurate boundaries for 150 words. This allows training models to classify if a gesture is semantic, identify the word it corresponds to, and locate it in time. A sympathetic reader would care because it removes a key bottleneck for multimodal models to understand natural human communication involving gestures.

Core claim

The central discovery is the creation of the GRW dataset comprising 156,688 manually annotated video clips that map unconstrained human gestures to a 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts, with frame-accurate temporal boundaries, enabling benchmarks for classifying semantic gestures, word recognition, and temporal localization.

What carries the argument

The GRW dataset, which provides manually annotated video clips linking gestures to specific words with precise start and end frames.

If this is right

  • Video models can be trained to classify whether a gesture is semantic or not.
  • Models can recognize the specific word corresponding to a co-speech gesture.
  • Models can temporally localize the gesture within the video.
  • Benchmarks are established for these three tasks using the dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This dataset could improve the performance of virtual assistants or video analysis tools in understanding contextual gestures.
  • It opens the possibility for integrating gesture recognition into real-time speech translation systems.
  • Future work might extend the taxonomy to include more words or different languages.

Load-bearing premise

The manual annotations are sufficiently accurate, consistent, and representative of real-world co-speech gestures across diverse videos and the 150-word taxonomy.

What would settle it

A study showing that different annotators disagree significantly on the gesture boundaries or word mappings, or that models trained on GRW fail to generalize to new videos not in the dataset.

Figures

Figures reproduced from arXiv: 2605.31589 by Andrew Zisserman, K R Prajwal, Sindhu B Hegde.

Figure 1
Figure 1. Figure 1: Semantic vs. non-semantic co-speech gestures. (Top) Obama performs an iconic semantic gesture for the word “massive”. Notice the significant temporal offset between the physical gesture and the spoken word. (Bottom) There are no semantic gestures around the word “beginning”. This paper focuses on building a dataset and models that can automatically recognize and localize semantic gestures in real-world cli… view at source ↗
Figure 2
Figure 2. Figure 2: Samples from the GRW dataset. (a) Lexical and kinematic diversity of seman￾tic gestures. (b) Even when gesturing the same conceptual word (“SPIRAL”), speakers employ different physical motions. The dataset captures these gestures across a wide variety of speakers, poses, and camera angles. The GRW dataset comprises 156,688 manually annotated video clips curated from diverse, in-the-wild speaker environment… view at source ↗
Figure 3
Figure 3. Figure 3: Semantic Taxonomy of the GRW Dataset. We organize our 150- word vocabulary into a three-tiered hi￾erarchy, radiating from five high-level semantic domains to specific target words (leaf nodes). Node size is strictly proportional to the frequency of anno￾tated clips per word. 4 Dataset Curation Pipeline In this section we describe the pipeline used to curate the GRW word-level se￾mantic gesture dataset and … view at source ↗
Figure 4
Figure 4. Figure 4: shows the proportion of gestured instances among all clips submitted for annotation. The likelihood that a word is accompanied by a semantically meaningful gesture varies dramatically across the lexicon. While some words are gestured very frequently, for example, ‘bye’ is gestured approximately 66%, while others are rarely depicted. At the extreme low end, ‘look’ is gestured in less than 1% of cases. This … view at source ↗
Figure 5
Figure 5. Figure 5: Gesture-speech temporal alignment. Semantic gestures rarely align per￾fectly with spoken words. As shown in (a), gestures are significantly longer in duration than speech. (b) The vast majority of gestures start before (96.6%) and end after (89.7%) the target word is spoken. (c) Activation heatmaps (aggregated across all samples for a specific word) visually confirm this “envelope” effect. 6 Recognizing Co… view at source ↗
Figure 6
Figure 6. Figure 6: (a) A binary semantic gesture classifier for a query clip. The model uses cross￾attention between a target query clip and a broader temporal motion context. (b) The word recognition and localization model is trained in two stages: first, pre-trained on auto-generated pseudo-labels with weak boundary supervision, and then fine-tuned on manually annotated clean data for precise temporal localization. 6.2 Sem… view at source ↗
Figure 7
Figure 7. Figure 7: We visualize sample predictions for joint word recognition and localization. In both instances, our model correctly assigns the target word (together, bye) and closely matches the ground truth temporal bounds. Notably, the right example illustrates our model’s ability to accurately localize anticipatory gestures that occur well before the corresponding speech is uttered [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 8
Figure 8. Figure 8: Data samples from the GRW dataset. The dataset encompasses a di￾verse vocabulary of 150 conceptual words, ranging from numbers (‘two’) and spatial descriptors (‘little’, ‘back’) to physical actions (‘grasp’, ‘stack’) and conversational an￾chors (‘hello’). For each positive instance, we provide precise, frame-accurate temporal boundaries marking the start and end of the semantic gesture (highlighted by the … view at source ↗
Figure 9
Figure 9. Figure 9: Stage-1: Semantic Gesture Annotation Interface. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Stage-2: Temporal Boundary Annotation Interface. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Analysis of typical failure cases. The word recognition model occasionally misclassifies gestures into semantically and visually synonymous categories. (Top) The model predicts “raise” instead of the ground-truth “peak” as the speaker moves his hand upward. (Bottom) A classic finger-pinching motion results in a prediction of “tiny” rather than the ground-truth “little”. These examples highlight the inhere… view at source ↗
read the original abstract

While humans naturally gesture during speech, only a sparse subset of these movements are visually depictive and semantically linked to specific spoken words. Current multimodal models struggle to capture these semantic co-speech gestures, heavily bottlenecked by a lack of precisely annotated training data. To address this, we introduce the Gesture Recognition in the Wild (GRW) dataset, the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. Comprising 156,688 manually annotated video clips, GRW spans a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts. We leverage GRW to train video models to (a) classify gestures as semantic or not, (b) recognize the word corresponding to a co-speech gesture, and (c) temporally localize the gesture. We also use GRW to establish benchmarks for these three tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that the GRW dataset is the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. It comprises 156,688 manually annotated video clips spanning a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts, and is used to define and benchmark three tasks: classifying gestures as semantic or not, recognizing the corresponding word, and temporally localizing the gesture.

Significance. If the annotations are shown to be accurate and consistent, the GRW dataset would address a clear data gap for semantic co-speech gesture understanding in multimodal models, providing scale and diversity that could support reproducible benchmarks across the three tasks.

major comments (2)
  1. [Abstract] Abstract: the central claim that GRW supplies 'frame-accurate temporal boundaries' for the three tasks rests on the quality of the manual annotations, yet the abstract supplies no annotation protocol, number of annotators per clip, adjudication process, or quantitative agreement statistics (e.g., temporal IoU or label kappa).
  2. [Abstract] Abstract: without reported inter-annotator agreement or boundary-precision metrics, the weakest assumption—that the 156,688 clips are sufficiently accurate and representative—remains unverified, rendering the benchmark claims for semantic classification, word recognition, and localization unevaluable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. We agree that the abstract should more explicitly support the annotation quality underlying the 'frame-accurate' claim and will revise it accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that GRW supplies 'frame-accurate temporal boundaries' for the three tasks rests on the quality of the manual annotations, yet the abstract supplies no annotation protocol, number of annotators per clip, adjudication process, or quantitative agreement statistics (e.g., temporal IoU or label kappa).

    Authors: We agree that the abstract would be stronger if it briefly referenced the annotation protocol. The full manuscript (Section 4) describes the multi-annotator process and adjudication. In revision we will add one concise sentence to the abstract summarizing the protocol, annotator count, and agreement statistics. revision: yes

  2. Referee: [Abstract] Abstract: without reported inter-annotator agreement or boundary-precision metrics, the weakest assumption—that the 156,688 clips are sufficiently accurate and representative—remains unverified, rendering the benchmark claims for semantic classification, word recognition, and localization unevaluable.

    Authors: We acknowledge the point: the current abstract does not report these metrics, making the claims harder to evaluate at a glance. The manuscript body provides annotation details; we will revise the abstract to include a short statement on inter-annotator agreement and boundary precision so the benchmark claims are more directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset paper with no derivations or fitted predictions.

full rationale

The paper introduces the GRW dataset of 156,688 manually annotated clips and defines three downstream tasks (semantic classification, word recognition, temporal localization). No equations, parameters, or closed-form claims appear anywhere in the provided text. The contribution is a new benchmark definition rather than a derivation that reduces to its own inputs. No self-citations, ansatzes, or uniqueness theorems are invoked to support any mathematical result. Annotation quality concerns (protocol, agreement metrics) affect verifiability but do not constitute circularity under the defined patterns, as there is no reduction of a claimed prediction to a fitted input or self-referential definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the unverified premise that manual annotation yields reliable frame-accurate labels; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)
  • domain assumption Manual annotations provide accurate frame-accurate temporal boundaries for gestures.
    The utility of GRW for training and benchmarking the three tasks depends directly on this premise.

pith-pipeline@v0.9.1-grok · 5681 in / 1169 out tokens · 25830 ms · 2026-06-28T23:12:50.942201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2506.22554 (2025)

    Agrawal, V., Akinyemi, A., Alvero, K., Behrooz, M., Buffalini, J., Carlucci, F.M., Chen, J., Chen, J., Chen, Z., Cheng, S., et al.: Seamless interaction: Dyadic audio- visual motion modeling and large-scale dataset. arXiv preprint arXiv:2506.22554 (2025)

  2. [2]

    In: Findings of the association for computational linguistics: EMNLP 2020

    Ahuja, C., Lee, D.W., Ishii, R., Morency, L.P.: No gestures left behind: Learning relationships between spoken language and freeform gestures. In: Findings of the association for computational linguistics: EMNLP 2020. pp. 1884–1895 (2020)

  3. [3]

    arXiv (2021)

    Albanie, S., Varol, G., Momeni, L., Bull, H., Afouras, T., Chowdhury, H., Fox, N., Woll, B., Cooper, R., McParland, A., Zisserman, A.: BOBSL: BBC-Oxford British Sign Language Dataset. arXiv (2021)

  4. [4]

    Alexanderson, S., Henter, G.E., Kucherenko, T., Beskow, J.: Style-controllable speech-driven gesture synthesis using normalising flows39(2), 487–496 (2020)

  5. [5]

    ACM Transac- tions on Graphics (TOG)41(6), 1–19 (2022)

    Ao, T., Gao, Q., Lou, Y., Chen, B., Liu, L.: Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transac- tions on Graphics (TOG)41(6), 1–19 (2022)

  6. [6]

    Personality and social psychology bulletin21(4), 394–405 (1995)

    Bavelas, J.B., Chovil, N., Coates, L., Roe, L.: Gestures specialized for dialogue. Personality and social psychology bulletin21(4), 394–405 (1995)

  7. [7]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F., Torres, J., Giro-i Nieto, X.: How2sign: a large-scale multimodal dataset for con- tinuous american sign language. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2735–2744 (2021)

  8. [8]

    In: Proceedings of the 27th ACM International Conference on Multimedia

    Dutta, A., Zisserman, A.: The VIA annotation software for images, audio and video. In: Proceedings of the 27th ACM International Conference on Multimedia. MM ’19, ACM, New York, NY, USA (2019).https://doi.org/10.1145/3343031. 3350535,https://doi.org/10.1145/3343031.3350535

  9. [9]

    ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

    Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Free- man, W.T., Rubinstein, M.: Looking to listen at the cocktail party: a speaker- independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

  10. [10]

    In: Proceedings of the 18th international conference on intelligent virtual agents

    Ferstl, Y., McDonnell, R.: Investigating the use of recurrent motion modelling for speech gesture generation. In: Proceedings of the 18th international conference on intelligent virtual agents. pp. 93–98 (2018)

  11. [11]

    Computer Animation and Virtual Worlds 32(3-4), e2016 (2021) 16 S Hegde et al

    Ferstl, Y., Neff, M., McDonnell, R.: Expressgesture: Expressive gesture generation from speech through database matching. Computer Animation and Virtual Worlds 32(3-4), e2016 (2021) 16 S Hegde et al

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individ- ual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3497–3506 (2019)

  13. [13]

    Google: Gemini 3.https://blog.google/products-and-platforms/products/ gemini/gemini-3/(2026), accessed: 2026-03-06

  14. [14]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) (2025)

    Gueuwou, S., Du, X., Shakhnarovich, G., Livescu, K., Liu, A.H.: Shubert: Self- supervised sign language representation learning via multi-stream cluster predic- tion. In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) (2025)

  15. [15]

    In: BMVC (2023)

    Hegde, S., Zisserman, A.: Gestsync: Determining who is speaking without a talking head. In: BMVC (2023)

  16. [16]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Hegde, S.B., Prajwal, K., Kwon, T., Zisserman, A.: Understanding co-speech ges- tures in-the-wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9977–9987 (2025)

  17. [17]

    arXiv preprint arXiv:2602.09146 (2026)

    Huberman, S., Goldberg, K., Patashnik, O., Benaim, S., Mokady, R.: Semanticmo- ments: Training-free motion similarity via third moment features. arXiv preprint arXiv:2602.09146 (2026)

  18. [18]

    In: 2019 14th IEEE interna- tional conference on automatic face & gesture recognition (FG 2019)

    Köpüklü, O., Gunduz, A., Kose, N., Rigoll, G.: Real-time hand gesture detection and classification using convolutional neural networks. In: 2019 14th IEEE interna- tional conference on automatic face & gesture recognition (FG 2019). IEEE (2019)

  19. [19]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lee, G., Deng, Z., Ma, S., Shiratori, T., Srinivasa, S.S., Sheikh, Y.: Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 763–772 (2019)

  20. [20]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Li, D., Rodriguez, C., Yu, X., Li, H.: Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1459– 1469 (2020)

  21. [21]

    In: Proceedings of the 2024 conference on empirical methods in natural language processing

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)

  22. [22]

    In: European conference on computer vision

    Liu, H., Zhu, Z., Iwamoto, N., Peng, Y., Li, Z., Zhou, Y., Bozkurt, E., Zheng, B.: Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis. In: European conference on computer vision. pp. 612–630. Springer (2022)

  23. [23]

    MediaPipe: A Framework for Building Perception Pipelines

    Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)

  24. [24]

    Neuro- computing508, 293–304 (2022)

    Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neuro- computing508, 293–304 (2022)

  25. [25]

    In: Proceedings of the IEEE/CVF interna- tional conference on computer vision workshops (2019)

    Materzynska, J., Berger, G., Bax, I., Memisevic, R.: The jester dataset: A large- scale video dataset of human gestures. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision workshops (2019)

  26. [26]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4207–4215 (2016) GRW 17

  27. [27]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Mughal, M.H., Dabral, R., Habibie, I., Donatelli, L., Habermann, M., Theobalt, C.:Convofusion:Multi-modalconversationaldiffusionforco-speechgesturesynthe- sis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1388–1398 (2024)

  28. [28]

    Nyatsanga, S., Kucherenko, T., Ahuja, C., Henter, G.E., Neff, M.: A comprehensive review of data-driven co-speech gesture generation42(2), 569–596 (2023)

  29. [29]

    In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Prajwal, K., Hegde, S., Zisserman, A.: Scaling multilingual visual speech recogni- tion. In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2025)

  30. [30]

    IEEE Trans

    Quiros, L.C., Tax, D.M.J., Hung, H.: Gestures in-the-wild: Detecting conversa- tional hand gestures in crowded scenes using a multimodal fusion of bags of video trajectories and body worn acceleration. IEEE Trans. Multim.22(1), 138–147 (2020).https://doi.org/10.1109/TMM.2019.2922122,https://doi.org/10. 1109/TMM.2019.2922122

  31. [31]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Su, J., Lu, Y., Pan, S., Wen, B., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. CoRRabs/2104.09864(2021),https://arxiv.org/ abs/2104.09864

  32. [32]

    Advances in Neural Information Pro- cessing Systems pp

    Uthus, D., Tanzer, G., Georg, M.: Youtube-asl: A large-scale, open-domain amer- ican sign language-english parallel corpus. Advances in Neural Information Pro- cessing Systems pp. 29029–29047 (2023)

  33. [33]

    In: Proceed- ings of the IEEE conference on computer vision and pattern recognition workshops

    Wan, J., Zhao, Y., Zhou, S., Guyon, I., Escalera, S., Li, S.Z.: Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceed- ings of the IEEE conference on computer vision and pattern recognition workshops. pp. 56–64 (2016)

  34. [34]

    In: European conference on computer vision

    Wang, C.Y., Yeh, I.H., Mark Liao, H.Y.: Yolov9: Learning what you want to learn using programmable gradient information. In: European conference on computer vision. pp. 1–21. Springer (2024)

  35. [35]

    In: European Conference on Computer Vision

    Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Yin, H., Chen, J., Jin, T., Wu, J., et al.: Internvideo2: Scaling video foundation models for multimodal video understanding. In: European Conference on Computer Vision. Springer (2024)

  36. [36]

    IEEE Transactions on Multimedia pp

    Zhang, Y., Cao, C., Cheng, J., Lu, H.: Egogesture: A new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia pp. 1038–1050 (2018)

  37. [37]

    LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

    Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J.,Li,Z.,etal.:Languagebind:Extendingvideo-languagepretrainington-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852 (2023) 18 S Hegde et al. A Dataset A.1 Data samples Fig 8 illustrates representative samples from the semantic subset of our man...

  38. [38]

    No semantic gestures present in the video

  39. [39]

    Semantic gesture may be present, but is not clear and/or is not fully visible

  40. [40]

    Yes” (gesture found) or “No

    Semantic gesture present, and is clearly visible. –Make a decision: “Yes” (gesture found) or “No” (gesture not found). The decision will be “Yes” only if category 3 occurs as explained above. –Give the confidence score between 0 and 1, 0 being the least confident and 1 being the most confident. –Provide a short reasoning. Output Format: Gesture present: <...

  41. [41]

    Analyze all the frames of the video together and produce one final gesture description that best represents what the hands are gesturing throughout these frames

  42. [42]

    Based on the description, rank a given set of 100 possible word classes with confidence scores (between 0 and 1) such that they sum to approximately 1.0

  43. [43]

    raising hand then pointing forward

    Predict the start and end time of the gesture in seconds (gesture boundary) for the recognized target word. Classes:bye, below, push, move, direction, specific, together, whole, hello, straight, above, switch, five, turn, expand, throw, open, huge, raise, large, tiny, long, mix, circle, no, two, three, entire, top, four, lower, full, press, small, grab, g...