arxiv: 2604.07606 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

Bootstrapping Sign Language Annotations with Sign Language Models

Colin Lea , Vasileios Baltatzis , Connor Gillis , Raja Kushalnagar , Lorna Quandt , Leah Findlater

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords sign language annotationpseudo-labelingASLfingerspellingisolated sign recognitionglossesclassifiersLLM ranking

0 comments

The pith

A pseudo-annotation pipeline uses fingerspelling and isolated sign recognizers plus LLM ranking to label large signed video datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to address the shortage of annotated sign language data by creating an automated pipeline. This pipeline processes signed videos along with English text to output ranked annotations that include time intervals for glosses, fingerspelled words, and sign classifiers. It builds on baseline models for fingerspelling recognition and isolated sign recognition that achieve strong results on public benchmarks. The method is validated by releasing both new human annotations on hundreds of videos and hundreds of hours of generated pseudo-annotations. If successful, this would allow fuller use of existing large but under-annotated datasets for developing AI sign language tools.

Core claim

The paper establishes a pipeline that takes signed video and English as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers, using sparse predictions from fingerspelling and isolated sign recognizers along with a K-Shot LLM approach.

What carries the argument

The pseudo-annotation pipeline that fuses sparse outputs from a fingerspelling recognizer and an isolated sign recognizer with K-shot large language model ranking.

If this is right

The pipeline allows generation of over 300 hours of pseudo-annotations for datasets like ASL STEM Wiki.
Baseline models reach state-of-the-art 6.7% CER on FSBoard for fingerspelling and 74% top-1 accuracy on ASL Citizen for isolated signs.
New human annotations on nearly 500 videos provide a benchmark for the pseudo-annotations.
Releasing both human and pseudo labels supports further research on sign language annotation.
Improved annotation would enable better training of sign language interpretation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the method generalizes, it could reduce annotation costs for sign language datasets in other languages by using similar base models.
The approach might be extended to continuous sign language recognition tasks by incorporating more temporal modeling.
Releasing the annotations creates an opportunity to test how well LLM ranking performs on other sequence annotation problems.

Load-bearing premise

Sparse predictions from the recognizers combined with K-Shot LLM ranking will yield annotations accurate enough to be useful without extensive human correction.

What would settle it

If the pseudo-annotations show low overlap or accuracy when compared directly to the professional human annotations on the nearly 500 videos, the utility of the pipeline would be called into question.

Figures

Figures reproduced from arXiv: 2604.07606 by Colin Lea, Connor Gillis, Leah Findlater, Lorna Quandt, Raja Kushalnagar, Vasileios Baltatzis.

**Figure 1.** Figure 1: Our pseudo-annotation pipeline takes in English text and ASL video and outputs likely gloss annotations. There are three steps: [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: (left) Architecture and design of the (left) Fingerspelling [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Candidate forced alignment plots using our pseudo-annotation pipeline. Gray areas are [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

AI-driven sign language interpretation is limited by a lack of high-quality annotated data. New datasets including ASL STEM Wiki and FLEURS-ASL contain professional interpreters and 100s of hours of data but remain only partially annotated and thus underutilized, in part due to the prohibitive costs of annotating at this scale. In this work, we develop a pseudo-annotation pipeline that takes signed video and English as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers. Our pipeline uses sparse predictions from our fingerspelling recognizer and isolated sign recognizer (ISR), along with a K-Shot LLM approach, to estimate these annotations. In service of this pipeline, we establish simple yet effective baseline fingerspelling and ISR models, achieving state-of-the-art on FSBoard (6.7% CER) and on ASL Citizen datasets (74% top-1 accuracy). To validate and provide a gold-standard benchmark, a professional interpreter annotated nearly 500 videos from ASL STEM Wiki with sequence-level gloss labels containing glosses, classifiers, and fingerspelling signs. These human annotations and over 300 hours of pseudo-annotations are being released in supplemental material.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical pipeline for pseudo-annotating sign language video and releases a new human benchmark plus 300 hours of pseudo data, but never measures how well the pipeline matches the human labels.

read the letter

The main takeaway is a pipeline that takes signed video and English text, runs it through a fingerspelling recognizer and an isolated sign recognizer, then uses K-shot prompting with an LLM to output ranked glosses with time intervals. They also release a 500-video human-annotated gold standard from ASL STEM Wiki and the corresponding pseudo-annotations. That data release is the concrete thing the field can use right away.

Referee Report

1 major / 0 minor

Summary. The paper introduces a pseudo-annotation pipeline for sign language videos that combines sparse outputs from a fingerspelling recognizer and isolated sign recognizer (ISR) with a K-shot LLM ranking step to generate ranked annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers. It reports state-of-the-art results on the FSBoard dataset (6.7% CER) and ASL Citizen dataset (74% top-1 accuracy), releases a new 500-video human-annotated sequence-level gloss benchmark from ASL STEM Wiki, and provides over 300 hours of pseudo-annotations.

Significance. If the pipeline's outputs prove sufficiently accurate to meaningfully reduce human annotation effort, the work would provide a scalable approach to annotating large sign language corpora such as ASL STEM Wiki and FLEURS-ASL, directly addressing a primary bottleneck in training AI systems for sign language interpretation. The release of both the human gold-standard benchmark and the pseudo-annotations constitutes a concrete, reusable resource for the community.

major comments (1)

[Abstract] Abstract: the central claim that the pipeline 'outputs a ranked set of likely annotations' usable for bootstrapping is unsupported by any quantitative evaluation. No metrics (gloss matching rate, temporal overlap, ranking quality, or human effort reduction) are reported comparing pipeline outputs to the 500-video human benchmark that the authors themselves created and release.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the pipeline 'outputs a ranked set of likely annotations' usable for bootstrapping is unsupported by any quantitative evaluation. No metrics (gloss matching rate, temporal overlap, ranking quality, or human effort reduction) are reported comparing pipeline outputs to the 500-video human benchmark that the authors themselves created and release.

Authors: We acknowledge that the manuscript does not include a direct quantitative comparison of the full pipeline outputs against the new 500-video human benchmark. The current version validates the pipeline components via state-of-the-art results on FSBoard and ASL Citizen and releases both the human annotations and the pseudo-annotations to support community evaluation. We agree that adding explicit metrics would better substantiate the bootstrapping claim. In the revised manuscript we will include an evaluation on the 500-video set reporting gloss matching rate, temporal overlap for time intervals, ranking quality, and a discussion of human effort reduction, and we will update the abstract to reference these results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the pseudo-annotation pipeline derivation

full rationale

The paper constructs its pseudo-annotation pipeline from independently trained fingerspelling and isolated sign recognizers (reporting SOTA on separate external benchmarks FSBoard and ASL Citizen) plus a K-shot LLM ranking step. These inputs are not defined in terms of the pipeline outputs, and no prediction reduces to a fitted parameter or self-definition by construction. The new 500-video human-annotated gold standard is created separately by a professional interpreter and is not used to derive or fit the pipeline itself. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard supervised learning for the recognizers and prompting for the LLM.

pith-pipeline@v0.9.0 · 5533 in / 1185 out tokens · 46919 ms · 2026-05-10T17:46:53.664212+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Fox, Necati Cihan Camgoz, and Richard Bowden

Mohamed Al-Badrashiny, Marco Tanti, Neil C. Fox, Necati Cihan Camgoz, and Richard Bowden. BOBSL: BBC- Oxford british sign language dataset. InEuropean Confer- ence on Computer Vision (ECCV), pages 40–58, 2022. 2

2022
[2]

NLTK: The natural language toolkit

Steven Bird and Edward Loper. NLTK: The natural language toolkit. InProceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain, July 2004. Association for Computational Linguistics. 6

2004
[3]

Sign Language Recog- nition, Generation, and Translation: An Interdisciplinary Perspective

Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braffort, Naomi Caselli, Matt Huenerfauth, Hernisa Kacorri, Tessa Verhoef, Christian V ogler, and Meredith Ringel Morris. Sign Language Recog- nition, Generation, and Translation: An Interdisciplinary Perspective. InThe 21st International ACM SIGACCESS Conference on Com...

2019
[4]

Neural sign language trans- lation

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Her- mann Ney, and Richard Bowden. Neural sign language trans- lation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7784–7793,
[5]

Quo vadis, action recognition? A new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017. 2

2017
[6]

Asl-lex: A lexical database of american sign language.Behavior research methods, 49(2):784–801, 2017

Naomi K Caselli, Zed Sevcikova Sehyr, Ariel M Cohen- Goldberg, and Karen Emmorey. Asl-lex: A lexical database of american sign language.Behavior research methods, 49(2):784–801, 2017. 6

2017
[7]

Factorized learn- ing assisted with large language model for gloss-free sign language translation

Zhigang Chen, Benjia Zhou, Jun Li, Jun Wan, Zhen Lei, Ning Jiang, Quan Lu, and Guoqing Zhao. Factorized learn- ing assisted with large language model for gloss-free sign language translation. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Ni- anwen Xue, editors,Proceedings of the 2024 Joint Interna- tional Conf...

2024
[8]

Google - american sign language fingerspelling recognition, 2023

Ashley Chow, Manfred Georg, Mark Sherwood, Phil Cul- liton, Sam Sepah, Sohier Dane, Thad Starner, and Glenn Cameron. Google - american sign language fingerspelling recognition, 2023. Kaggle Competition. 7

2023
[9]

Fleurs: Few-shot learning evaluation of uni- versal representations of speech

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of uni- versal representations of speech. In2022 IEEE Spoken Lan- guage Technology Workshop (SLT), pages 798–805. IEEE,
[10]

Asl cit- izen: a community-sourced dataset for advancing isolated sign language recognition.Advances in Neural Information Processing Systems, 36:76893–76907, 2023

Aashaka Desai, Lauren Berger, Fyodor Minakov, Nessa Mi- lano, Chinmay Singh, Kriston Pumphrey, Richard Ladner, Hal Daum ´e III, Alex X Lu, Naomi Caselli, et al. Asl cit- izen: a community-sourced dataset for advancing isolated sign language recognition.Advances in Neural Information Processing Systems, 36:76893–76907, 2023. 2, 3, 6, 7

2023
[11]

Towards automatic sign language annotation for the elan tool

Philippe Dreuw and Hermann Ney. Towards automatic sign language annotation for the elan tool. InWorkshop Pro- gramme, volume 50. 1
[12]

How2Sign: A large-scale multimodal dataset for continuous american sign language

Amanda Duarte, Shruti Palaskar, Lucas Ventura, Francisco De la Torre, Florian Metze, and Jimmy Saragih. How2Sign: A large-scale multimodal dataset for continuous american sign language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2735–2744, 2021. 2, 6

2021
[13]

Multiscale vision transformers

Haoqi Fan, Bo Li, Kaiming He, Ross Girshick, and Piotr Doll´ar. Multiscale vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5824–5835, 2021. 2

2021
[14]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6202–6211, 2019. 2

2019
[15]

Fsboard: Over 3 million characters of asl fin- gerspelling collected via smartphones

Manfred Georg, Garrett Tanzer, Esha Uboweja, Saad Has- san, Maximus Shengelia, Sam Sepah, Sean Forbes, and Thad Starner. Fsboard: Over 3 million characters of asl fin- gerspelling collected via smartphones. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13897–13906, June 2025. 2, 6, 7

2025
[16]

Understanding deaf and hard-of-hearing users’ interest in sign-language interaction with personal-assistant devices

Abraham Glasser, Vaishnavi Mande, and Matt Huenerfauth. Understanding deaf and hard-of-hearing users’ interest in sign-language interaction with personal-assistant devices. In Proceedings of the 18th International Web for All Confer- ence, pages 1–11, 2021. 1

2021
[17]

Prosodylab-aligner: A tool for forced alignment of labora- tory speech.Canadian acoustics, 39(3):192–193, 2011

Kyle Gorman, Jonathan Howell, and Michael Wagner. Prosodylab-aligner: A tool for forced alignment of labora- tory speech.Canadian acoustics, 39(3):192–193, 2011. 4

2011
[18]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J¨urgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international confer- ence on Machine learning, pages 369–376, 2006. 4

2006
[19]

Shubert: Self-supervised sign language representation learning via multi-stream clus- ter prediction

Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu, and Alexander H Liu. Shubert: Self-supervised sign language representation learning via multi-stream clus- ter prediction. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. 7

2025
[20]

Hamnosys–representing sign language data in language resources and language processing contexts

Thomas Hanke. Hamnosys–representing sign language data in language resources and language processing contexts. In sign-lang@ LREC 2004, pages 1–6. European Language Re- sources Association (ELRA), 2004. 1

2004
[21]

Second language acquisition across modalities: Production variability in adult l2 learn- ers of american sign language.Second Language Research, 31(3):375–388, 2015

Allison I Hilger, Torrey MJ Loucks, David Quinto-Pozos, and Matthew WG Dye. Second language acquisition across modalities: Production variability in adult l2 learn- ers of american sign language.Second Language Research, 31(3):375–388, 2015. 3

2015
[22]

Explor- ing very low-resource translation with LLMs: The Univer- sity of Edinburgh’s submission to AmericasNLP 2024 trans- lation task

Vivek Iyer, Bhavitvya Malik, Wenhao Zhu, Pavel Stepachev, Pinzhen Chen, Barry Haddow, and Alexandra Birch. Explor- ing very low-resource translation with LLMs: The Univer- sity of Edinburgh’s submission to AmericasNLP 2024 trans- lation task. In Manuel Mager, Abteen Ebrahimi, Shruti Ri- jhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, and Katharina von...

2024
[23]

Lost in translation, found in con- text: Sign language translation with contextual cues

Youngjoon Jang, Haran Raajesh, Liliane Momeni, G¨ul Varol, and Andrew Zisserman. Lost in translation, found in con- text: Sign language translation with contextual cues. In 2025 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 8742–8752, 2025. 1, 2

2025
[24]

SignLLM: A framework for sign language translation with large language models.arXiv preprint arXiv:2402.13011,

Aiwei Jiang, Yutong Wang, Bowen Zheng, and Jing Wang. SignLLM: A framework for sign language translation with large language models.arXiv preprint arXiv:2402.13011,

work page arXiv
[25]

Fingerspelling: Beyond handshape sequences.The Oxford handbook of deaf studies in language, pages 146–160, 2016

Jonathan Keane and Diane Brentari. Fingerspelling: Beyond handshape sequences.The Oxford handbook of deaf studies in language, pages 146–160, 2016. 2

2016
[26]

Word-level deep sign language recognition from video: A new large-scale dataset and methods

Dongxu Li, Cristian Rodriguez, Xin Yu, and Hongdong Li. Word-level deep sign language recognition from video: A new large-scale dataset and methods. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1458–1467, 2020. 2

2020
[27]

Bridging sign and spoken languages: Pseudo glossgeneration for sign language translation

Peike Li, Trevor Cohn, and Jianyuan Guo. Bridging sign and spoken languages: Pseudo glossgeneration for sign language translation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 1, 2, 3

2025
[28]

MediaPipe: A Framework for Building Perception Pipelines

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris Mc- Clanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo- Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Medi- apipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172, 2019. 4

work page internal anchor Pith review arXiv 1906
[29]

Spotter+gpt: Turning sign spottings into sentences with llms

Ozge Mercanoglu Sincan and Richard Bowden. Spotter+gpt: Turning sign spottings into sentences with llms. InAdjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents, IV A Adjunct ’25, page 1–6. ACM, Sept. 2025. 3

2025
[30]

How the alphabet came to be used in a sign language.Sign Language Studies, 4(1):10–33, 2003

Carol Padden and Darline Clark Gunsauls. How the alphabet came to be used in a sign language.Sign Language Studies, 4(1):10–33, 2003. 2

2003
[31]

Bleu: a Method for Automatic Evaluation of Ma- chine Translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Ma- chine Translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meet- ing of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. As- sociation for C...

2002
[32]

Sources of handshape error in first- time signers of asl.Deaf around the world: The impact of language, pages 96–121, 2011

Deborah Chen Pichler. Sources of handshape error in first- time signers of asl.Deaf around the world: The impact of language, pages 96–121, 2011. 15

2011
[33]

chrf: character n-gram f-score for automatic mt evaluation

Maja Popovi ´c. chrf: character n-gram f-score for automatic mt evaluation. InProceedings of the tenth workshop on sta- tistical machine translation, pages 392–395, 2015. 7, 14

2015
[34]

COMET: A neural framework for MT evaluation

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online, Nov. 2020. Association for Computational Linguis- tics. 7, 14

2020
[35]

L2m1 and l2m2 acquisition of sign lexicon: The impact of multimodality on the sign second language acquisition.Frontiers in Psychol- ogy, 13:896254, 2022

Krister Sch ¨onstr¨om and Ingela Holmstr¨om. L2m1 and l2m2 acquisition of sign lexicon: The impact of multimodality on the sign second language acquisition.Frontiers in Psychol- ogy, 13:896254, 2022. 3

2022
[36]

Bleurt: Learning robust metrics for text generation

Thibault Sellam, Dipanjan Das, and Ankur Parikh. Bleurt: Learning robust metrics for text generation. InProceedings of the 58th annual meeting of the association for computa- tional linguistics, pages 7881–7892, 2020. 7, 14

2020
[37]

Open-domain sign language translation learned from online video

Bowen Shi, Diane Brentari, Greg Shakhnarovich, and Karen Livescu. Open-domain sign language translation learned from online video. InEMNLP, 2022. 2

2022
[38]

Gloss-free sign language translation: An unbiased evaluation of progress in the field.Computer Vision and Image Understanding, 261:104498, 2025

Ozge Mercanoglu Sincan, Jian He Low, Sobhan Asasi, and Richard Bowden. Gloss-free sign language translation: An unbiased evaluation of progress in the field.Computer Vision and Image Understanding, 261:104498, 2025. 1, 2

2025
[39]

The signwriting alphabet.Read and Write any Sign Language in the World

Valerie Sutton. The signwriting alphabet.Read and Write any Sign Language in the World. ISWA Manual, 2010. 1

2010
[40]

Fingerspelling within sign language transla- tion, 2024

Garrett Tanzer. Fingerspelling within sign language transla- tion, 2024. 5, 6, 7

2024
[41]

FLEURS-ASL: Including American Sign Language in massively multilingual multitask evaluation

Garrett Tanzer. FLEURS-ASL: Including American Sign Language in massively multilingual multitask evaluation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6167–6191, ...

2025
[42]

Reconsidering sentence-level sign language translation, 2024

Garrett Tanzer, Maximus Shengelia, Ken Harrenstien, and David Uthus. Reconsidering sentence-level sign language translation, 2024. 6

2024
[43]

Youtube-SL-25: A large- scale, open-domain multilingual sign language parallel cor- pus

Garrett Tanzer and Biao Zhang. Youtube-SL-25: A large- scale, open-domain multilingual sign language parallel cor- pus. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2

2025
[44]

Ladner, and Danielle Bragg

Nina Tran, Richard E. Ladner, and Danielle Bragg. U.s. deaf community perspectives on automatic sign language transla- tion. InProceedings of the 25th International ACM SIGAC- CESS Conference on Computers and Accessibility, ASSETS ’23, New York, NY , USA, 2023. Association for Computing Machinery. 1

2023
[45]

Gallaudet University Press,

Clayton Valli and Ceil Lucas.Linguistics of American sign language: An introduction. Gallaudet University Press,
[46]

Gloss-free sign language recognition with question- answering

G ¨ul Varol, Ivan Laptev, Cordelia Schmid, and Andrew Zis- serman. Gloss-free sign language recognition with question- answering. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 30–39, 2021. 2

2021
[47]

SIGN2GPT: LEVERAGING LARGE LANGUAGE MOD- ELS FOR GLOSS-FREE SIGN LANGUAGE TRANSLA-

Ryan Wong, Necati Cihan Camgoz, and Richard Bowden. SIGN2GPT: LEVERAGING LARGE LANGUAGE MOD- ELS FOR GLOSS-FREE SIGN LANGUAGE TRANSLA-
[48]

ASL STEM Wiki: Dataset and benchmark for inter- preting STEM articles

Kayo Yin, Chinmay Singh, Fyodor O Minakov, Vanessa Mi- lan, Hal Daum´e Iii, Cyril Zhang, Alex Xijie Lu, and Danielle Bragg. ASL STEM Wiki: Dataset and benchmark for inter- preting STEM articles. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Process- ing, pages 14...

2024
[49]

Towards ai-driven sign language generation with non- manual markers

Han Zhang, Rotem Shalev-Arkushin, Vasileios Baltatzis, Connor Gillis, Gierad Laput, Raja Kushalnagar, Lorna C Quandt, Leah Findlater, Abdelkareem Bedri, and Colin Lea. Towards ai-driven sign language generation with non- manual markers. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26, 2025. 3

2025
[50]

SignCLIP: Contrastive language-image pre- training for zero-shot sign language recognition

Yutong Zheng, Yutong Zuo, Fangyun Wei, Wei-Shi Wang, and Xiao Wang. SignCLIP: Contrastive language-image pre- training for zero-shot sign language recognition. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 16006–16016, 2023. 7

2023
[51]

Gloss-free sign language translation: Improving from visual- language pretraining

Benjia Zhou, Zhigang Chen, Albert Clap ´es, Jun Wan, Yanyan Liang, Sergio Escalera, Zhen Lei, and Du Zhang. Gloss-free sign language translation: Improving from visual- language pretraining. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 20871–20881, October 2023. 1, 2

2023
[52]

Improving sign language translation with monolingual data by sign back-translation

Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. Improving sign language translation with monolingual data by sign back-translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1316–1325, 2021. 2

2021
[53]

CiCo: Visual-textual contrastive learn- ing for sign language recognition

Yutong Zuo, Fangyun Wei, Zeng-Fei Wang, Wei-Shi Wang, and Xiao Wang. CiCo: Visual-textual contrastive learn- ing for sign language recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15951–15962, 2023. 2 A. LLM Prompts In the paper we share results using three LLM prompts: (1) K-Shot English to ASL Gloss, (...

2023
[54]

NUMBERS: - Write numbers 1-9 as digits: 1, 2, 3, etc. - Numbers 10+ use conceptual signing: ”100”→”1 HUNDRED”, ”25”→”2 5” or ”TWENTY-FIVE” - Years: Use full digits ”1998” or conceptual ”NINETEEN NINETY-EIGHT” - Ages: ”AGE” + number - Time: Use appropriate time markers (MORNING, AFTERNOON, etc.)

1998
[55]

Don’t use ”FS-” or have dashes between letters

GRAMMAR MARKERS: - Use hyphens for compound signs: SELF-CONTROL - Fingerspelling: JOHN for names not having signs. Don’t use ”FS-” or have dashes between letters. - Use classifiers when appropriate (e.g., CL:1(POINT) or CL:4(list)) - Prefer classifiers instead of index notation - Use a special character instead of text (e.g., use ”+” instead of ”PLUS”)
[56]

- Sometimes follow ASL word order (typically Time-Subject-Object-Verb)

ASL STRUCTURE: - Every sentence you generate should consider using a different grammatical structure. - Sometimes follow ASL word order (typically Time-Subject-Object-Verb)
[57]

‘json ”1

COMMON CONVENTIONS: - Past tense: FINISH or time markers, not English -ed - Questions: Use question markers and facial expressions - Negation: Use NOT before the sign EXAMPLE: English: ”I am happy” Output: “‘json ”1”: ”I AM HAPPY”, ”2”: ”CL:1(point) AM HAPPY”, ”3”: I VERY HAPPY”, ”4”: ”ME AM HAPPY”, ”5”: ”I SO HAPPY”, ”6”: ”I REALLY HAPPY”, ”7”: ”I AM GLA...