Sign Language Recognition in the Age of LLMs

Ivan Gruber; Jakub Honzik; Marek Hruz; Tomas Zelezny; Vaclav Javorek

arxiv: 2604.11225 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.CL

Sign Language Recognition in the Age of LLMs

Vaclav Javorek , Jakub Honzik , Ivan Gruber , Tomas Zelezny , Marek Hruz This is my paper

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords sign language recognitionvision-language modelszero-shot learningisolated sign language recognitionWLASL300multimodal alignmentprompt-based inference

0 comments

The pith

Open-source VLMs lag far behind supervised classifiers in zero-shot sign language recognition but capture partial visual-semantic alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether general vision-language models can handle isolated sign language recognition without any task-specific training or fine-tuning. It runs prompt-only zero-shot tests on the WLASL300 benchmark using both open-source and proprietary models. Open-source models trail classic supervised classifiers by a wide margin, yet follow-up checks show they still link sign videos to matching text descriptions to a limited degree. Larger proprietary models reach much higher accuracy levels. The results indicate that model scale and the variety of pre-training data matter for closing performance gaps on specialized visual tasks.

Core claim

Under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin on the WLASL300 benchmark. Follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity.

What carries the argument

prompt-only zero-shot inference with vision-language models on sign language video clips from the WLASL300 benchmark

If this is right

Larger model scale and broader training data improve zero-shot performance on sign recognition tasks.
VLMs already encode some correspondence between visual sign features and natural language descriptions.
Supervised classifiers trained specifically for ISLR continue to outperform general-purpose zero-shot VLMs.
Partial alignment in current models suggests potential for hybrid approaches that combine VLMs with limited supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed partial alignment could be used to bootstrap few-shot learning pipelines for new sign languages or dialects.
Sign language datasets like WLASL300 may serve as diagnostic tools for testing multimodal alignment in future VLMs.
If scaling trends continue, proprietary models might reduce reliance on large labeled sign datasets for practical recognition systems.

Load-bearing premise

The chosen prompts, WLASL300 benchmark splits, and evaluation protocol provide an unbiased test of zero-shot capability without hidden advantages from prompt engineering or dataset characteristics.

What would settle it

Running the exact same prompt-only zero-shot protocol on WLASL300 with a new open-source VLM that reaches accuracy within 10 percentage points of a standard supervised classifier would falsify the wide-margin lag claim.

Figures

Figures reproduced from arXiv: 2604.11225 by Ivan Gruber, Jakub Honzik, Marek Hruz, Tomas Zelezny, Vaclav Javorek.

read the original abstract

Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Open-source VLMs trail supervised models on zero-shot WLASL300 by a wide margin, but the paper's value rests on whether the prompting and output mapping are fully reproducible.

read the letter

The main thing to know is that this paper runs several VLMs zero-shot on the WLASL300 isolated sign language benchmark and reports that open-source ones lag far behind classic supervised classifiers, while larger proprietary models close much of the gap and show partial visual-semantic alignment with sign descriptions. They also release code, which is the most concrete part of the work. That gives the community a fresh set of numbers on how general multimodal models handle this specialized task without any fine-tuning. The follow-up checks on alignment are a reasonable way to probe what the models are actually capturing. Those elements are useful as a data point even if nothing in the method is new. The soft spot is the evaluation protocol. The abstract and high-level description do not lay out the exact prompts, how VLM text outputs get mapped back to the 300 classes, or confirmation that the data splits match the supervised baselines they cite. Small differences in parsing or prompt wording can shift accuracy numbers noticeably on this kind of task, and without those steps written down it is hard to judge how robust the reported gap really is. Dataset overlap with VLM pre-training data could also be a factor. This is the kind of paper that belongs in a reading group focused on VLM evaluation or sign-language tech if people want quick empirical numbers rather than new algorithms. It does not claim to solve the problem or introduce frameworks, so its main audience is researchers who track how far general models have come on niche recognition problems. I would send it to peer review. The results are worth archiving with clearer protocol details, and the code release makes verification feasible.

Referee Report

3 major / 1 minor

Summary. The manuscript investigates whether modern vision-language models (VLMs) can perform isolated sign language recognition (ISLR) in a zero-shot setting without task-specific training. Using the WLASL300 benchmark, it reports that open-source VLMs lag far behind supervised ISLR classifiers, yet capture partial visual-semantic alignment between signs and text; larger proprietary models perform substantially better. Code is released publicly.

Significance. Should the empirical findings prove robust, the work would usefully document the current gap between general VLMs and specialized supervised models on sign-language tasks, while pointing to scale and data diversity as key factors. Public code release is a positive contribution to reproducibility.

major comments (3)

[Abstract] Abstract: the central claim that open-source VLMs 'remain far behind classic supervised ISLR classifiers by a wide margin' under prompt-only zero-shot inference cannot be evaluated because the abstract (and, by extension, the methods) supplies no description of the prompts, the procedure for mapping free-form VLM text outputs to the 300 WLASL classes, or the exact train/test splits employed.
[Experiments] Experiments section: the 'partial visual-semantic alignment' result is load-bearing for the paper's nuanced conclusion, yet no quantitative protocol (e.g., similarity threshold, top-k matching, or LLM-as-judge) is stated for how alignment between sign videos and text descriptions is measured.
[Experiments] Experiments section: direct comparison to supervised baselines requires that the WLASL300 splits match those used in the cited prior work; no such verification or table of split statistics is provided, leaving open the possibility that the reported performance gap is partly an artifact of split mismatch.

minor comments (1)

[Abstract] Abstract: the sentence 'follow-up experiments reveal that these models capture partial visual-semantic alignment' is too terse; a single clause clarifying the measurement would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of clarity and reproducibility in our evaluation protocol. We have revised the manuscript to address each of the major comments by expanding the abstract, methods, and experiments sections with the requested details. Our responses are provided point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that open-source VLMs 'remain far behind classic supervised ISLR classifiers by a wide margin' under prompt-only zero-shot inference cannot be evaluated because the abstract (and, by extension, the methods) supplies no description of the prompts, the procedure for mapping free-form VLM text outputs to the 300 WLASL classes, or the exact train/test splits employed.

Authors: We agree that the abstract and methods should provide sufficient high-level information to evaluate the central claim. In the revised manuscript, we have updated the abstract to briefly note the use of prompt-only zero-shot inference with output mapping to the 300 classes. We have also expanded Section 3.2 (Evaluation Protocol) to describe the specific prompts employed for each model family, the mapping procedure (exact string matching for direct class names combined with sentence-BERT semantic similarity for free-form outputs, with ties broken by highest similarity score), and the precise WLASL300 train/test splits (which follow the standard 80/20 per-class division from the original WLASL release). A new Table 1 now lists split statistics. revision: yes
Referee: [Experiments] Experiments section: the 'partial visual-semantic alignment' result is load-bearing for the paper's nuanced conclusion, yet no quantitative protocol (e.g., similarity threshold, top-k matching, or LLM-as-judge) is stated for how alignment between sign videos and text descriptions is measured.

Authors: We acknowledge that an explicit quantitative protocol is necessary to substantiate the partial alignment claim. In the revised Experiments section (new subsection 4.3), we now specify the protocol in full: video and text embeddings are extracted from the VLM's respective encoders, cosine similarity is computed for each sign-description pair, and alignment is quantified via (i) mean similarity score across correct pairs, (ii) top-5 retrieval accuracy among the 300 class descriptions, and (iii) the fraction of pairs exceeding a fixed cosine threshold of 0.25. These metrics are reported with confidence intervals and directly support the 'partial' characterization without relying on LLM-as-judge. revision: yes
Referee: [Experiments] Experiments section: direct comparison to supervised baselines requires that the WLASL300 splits match those used in the cited prior work; no such verification or table of split statistics is provided, leaving open the possibility that the reported performance gap is partly an artifact of split mismatch.

Authors: We confirm that our WLASL300 experiments use the identical per-class train/test splits as the original WLASL dataset and the supervised baselines cited in the paper. To eliminate any ambiguity, the revised manuscript includes a new Table 2 that tabulates the exact number of training and test videos per class and states that these numbers match those reported in the baseline papers (e.g., the 300-class subset splits from the WLASL authors and subsequent ISLR works). The split files are also released alongside our code to allow direct verification. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or self-referential fits.

full rationale

The paper reports zero-shot VLM performance on the external WLASL300 benchmark against supervised ISLR baselines. No equations, parameter fitting, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methodology. All claims rest on direct experimental measurements and code release, making the work self-contained against external benchmarks without any reduction of results to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard ML evaluation assumptions such as benchmark representativeness and prompt validity as a capability probe. No free parameters, new axioms, or invented entities are introduced beyond those in the cited VLM literature.

pith-pipeline@v0.9.0 · 5449 in / 1022 out tokens · 46786 ms · 2026-05-10T15:31:21.413668+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Flamingo: a visual language model for few- shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi ´n...

work page 2022
[2]

Pyav: Pythonic bindings for ffm- peg.https://github.com/PyAV-Org/PyAV,

Mike Boers. Pyav: Pythonic bindings for ffm- peg.https://github.com/PyAV-Org/PyAV,

work page
[3]

Language models are few-shot learn- ers.Advances in neural information processing sys- tems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers.Advances in neural information processing sys- tems, 33:1877–1901, 2020. 3

work page 1901
[4]

Pillow (pil fork) documentation, 2015

Alex Clark. Pillow (pil fork) documentation, 2015. 4

work page 2015
[5]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretrain- ing.arXiv preprint arXiv:2505.14683, 2025. 2, 6

work page internal anchor Pith review arXiv 2025
[6]

Geo-sign: Hy- perbolic contrastive regularisation for geometrically aware sign language translation

Edward Fish and Richard Bowden. Geo-sign: Hy- perbolic contrastive regularisation for geometrically aware sign language translation. InThe Thirty-ninth 7 Annual Conference on Neural Information Processing Systems, 2025. 2, 3

work page 2025
[7]

Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Google. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 2, 6

work page 2025
[8]

Handspeak: A sign language dictio- nary online.Reference Reviews, 16(3):21–21, 2002

Ann Grafstein. Handspeak: A sign language dictio- nary online.Reference Reviews, 16(3):21–21, 2002. 3, 5

work page 2002
[9]

SignMusketeers: An efficient multi-stream approach for sign language translation at scale

Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, and Karen Livescu. SignMusketeers: An efficient multi-stream approach for sign language translation at scale. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 22506–22521, Vienna, Austria, 2025. Association for Computational Linguistics. 2

work page 2025
[10]

Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu, and Alexander H. Liu. SHuBERT: Self-supervised sign language representation learning via multi-stream cluster prediction. InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 28792–28810, Vienna, Austria, 2025. Association ...

work page 2025
[11]

Acceler- ate: Training and inference at scale made simple, ef- ficient and adaptable.https://github.com/ huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Man- grulkar, Marc Sun, and Benjamin Bossan. Acceler- ate: Training and inference at scale made simple, ef- ficient and adaptable.https://github.com/ huggingface/accelerate, 2022. 4

work page 2022
[12]

Signbert: Pre-training of hand-model-aware representation for sign language recognition

Hezhen Hu, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. Signbert: Pre-training of hand-model-aware representation for sign language recognition. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 11087–11096, 2021. 3

work page 2021
[13]

Global-local enhancement network for nmf-aware sign language recognition.ACM Trans

Hezhen Hu, Wengang Zhou, Junfu Pu, and Houqiang Li. Global-local enhancement network for nmf-aware sign language recognition.ACM Trans. Multimedia Comput. Commun. Appl., 17(3), 2021. 3

work page 2021
[14]

Hezhen Hu, Weichao Zhao, Wengang Zhou, and Houqiang Li. Signbert+: Hand-model-aware self- supervised pre-training for sign language understand- ing.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(9):11221–11239, 2023. 3

work page 2023
[15]

Attention-based 3d-cnns for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822–2832,

Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. Attention-based 3d-cnns for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822–2832,

work page
[16]

Internvl3.5: Advancing open-source multimodal models in versa- tility, reasoning, and efficiency, 2025

Shanghai AI Laboratory InternVL Team. Internvl3.5: Advancing open-source multimodal models in versa- tility, reasoning, and efficiency, 2025. 2, 6

work page 2025
[17]

Lost in Translation, Found in Embeddings: Sign Lan- guage Translation and Alignment.arXiv e-prints, art

Youngjoon Jang, Liliane Momeni, Zifan Jiang, Joon Son Chung, G ¨ul Varol, and Andrew Zisserman. Lost in Translation, Found in Embeddings: Sign Lan- guage Translation and Alignment.arXiv e-prints, art. arXiv:2512.08040, 2025. 2, 3

work page arXiv 2025
[18]

Visual alignment pre-training for sign language translation

Peiqi Jiao, Yuecong Min, and Xilin Chen. Visual alignment pre-training for sign language translation. InComputer Vision – ECCV 2024, pages 349–367, Cham, 2025. Springer Nature Switzerland. 2

work page 2024
[19]

://arxiv.org/abs/1812.01053, https://arxiv.org/abs/1812.01053 arXiv:1812.01053

Hamid Reza Vaezi Joze and Oscar Koller. Ms- asl: A large-scale data set and benchmark for un- derstanding american sign language.arXiv preprint arXiv:1812.01053, 2018. 3

work page arXiv 2018
[20]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045– 3059, 2021. 3

work page 2021
[21]

Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison

DONGXU LI, Cristian Rodriguez, Xin Yu, and HONGDONG LI. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020. 3

work page 2020
[22]

BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InProceedings of the 40th Inter- national Conference on Machine Learning, pages 19730–19742. PMLR, 2023. 2

work page 2023
[23]

Prefix-tuning: Op- timizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Op- timizing continuous prompts for generation. InPro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 4582–4597,

work page
[24]

Uni-sign: Toward unified sign language understanding at scale

Zecheng Li, Wengang Zhou, Weichao Zhao, Kepeng Wu, Hezhen Hu, and Houqiang Li. Uni-sign: Toward unified sign language understanding at scale. InThe Thirteenth International Conference on Learning Rep- resentations, 2025. 2, 3

work page 2025
[25]

Llava-slt: Visual language tuning for sign language transla- tion.arXiv preprint arXiv:2412.16524, 2024

Han Liang, Chengyu Huang, Yuecheng Xu, Cheng Tang, Weicai Ye, Juze Zhang, Xin Chen, Jingyi Yu, and Lan Xu. Llava-slt: Visual language tun- ing for sign language translation.arXiv preprint arXiv:2412.16524, 2024. 2

work page arXiv 2024
[26]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAd- vances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 2

work page 2023
[27]

Skeleton-based sign language 8 recognition using a dual-stream spatio-temporal dy- namic graph convolutional network.arXiv preprint arXiv:2509.08661, 2025

Liangjin Liu, Haoyang Zheng, Zhengzhong Zhu, and Pei Zhou. Skeleton-based sign language 8 recognition using a dual-stream spatio-temporal dy- namic graph convolutional network.arXiv preprint arXiv:2509.08661, 2025. 3

work page arXiv 2025
[28]

George A. Miller. WordNet: A lexical database for English. InHuman Language Technology: Proceed- ings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994. 3

work page 1994
[29]

Nvidia nemotron nano v2 vl, 2025

NVIDIA. Nvidia nemotron nano v2 vl, 2025. 2, 6

work page 2025
[30]

Openai gpt-5 system card, 2025

OpenAI. Openai gpt-5 system card, 2025. 2, 6

work page 2025
[31]

Qwen2.5 technical report,

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page
[32]

Qwen3-vl technical report, 2025

QwenTeam. Qwen3-vl technical report, 2025. 2, 6

work page 2025
[33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021
[34]

Towards privacy-aware sign language translation at scale

Phillip Rust, Bowen Shi, Skyler Wang, Necati Cihan Camg¨oz, and Jean Maillard. Towards privacy-aware sign language translation at scale. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 8624–8641, 2024. 2, 3

work page 2024
[35]

Autsl: A large scale multi-modal turkish sign lan- guage dataset and baseline methods.IEEE Access, 8: 181340–181355, 2020

Ozge Mercanoglu Sincan and Hacer Yalim Keles. Autsl: A large scale multi-modal turkish sign lan- guage dataset and baseline methods.IEEE Access, 8: 181340–181355, 2020. 3

work page 2020
[36]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural informa- tion processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural informa- tion processing systems, 35:24824–24837, 2022. 3

work page 2022
[38]

Transformers: State-of- the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of- the-art...

work page 2020
[39]

Sign2GPT: Leveraging large language models for gloss-free sign language translation

Ryan Wong, Necati Cihan Camgoz, and Richard Bow- den. Sign2GPT: Leveraging large language models for gloss-free sign language translation. InThe Twelfth In- ternational Conference on Learning Representations,

work page
[40]

Spatial temporal graph convolutional networks for skeleton- based action recognition.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018

Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton- based action recognition.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. 3

work page 2018
[41]

Scal- ing sign language translation

Biao Zhang, Garrett Tanzer, and Orhan Firat. Scal- ing sign language translation. InAdvances in Neu- ral Information Processing Systems, pages 114018– 114047. Curran Associates, Inc., 2024. 2, 3

work page 2024
[42]

Llava-next: A strong zero-shot video under- standing model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chun- yuan Li. Llava-next: A strong zero-shot video under- standing model, 2024. 2, 6

work page 2024
[43]

Conditional prompt learning for vision- language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision- language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 16816–16825, 2022. 3

work page 2022
[44]

Scaling up multimodal pre- training for sign language understanding.CoRR, abs/2408.08544, 2024

Wengang Zhou, Weichao Zhao, Hezhen Hu, Zecheng Li, and Houqiang Li. Scaling up multimodal pre- training for sign language understanding.CoRR, abs/2408.08544, 2024. 3 9

work page arXiv 2024

[1] [1]

Flamingo: a visual language model for few- shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi ´n...

work page 2022

[2] [2]

Pyav: Pythonic bindings for ffm- peg.https://github.com/PyAV-Org/PyAV,

Mike Boers. Pyav: Pythonic bindings for ffm- peg.https://github.com/PyAV-Org/PyAV,

work page

[3] [3]

Language models are few-shot learn- ers.Advances in neural information processing sys- tems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers.Advances in neural information processing sys- tems, 33:1877–1901, 2020. 3

work page 1901

[4] [4]

Pillow (pil fork) documentation, 2015

Alex Clark. Pillow (pil fork) documentation, 2015. 4

work page 2015

[5] [5]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretrain- ing.arXiv preprint arXiv:2505.14683, 2025. 2, 6

work page internal anchor Pith review arXiv 2025

[6] [6]

Geo-sign: Hy- perbolic contrastive regularisation for geometrically aware sign language translation

Edward Fish and Richard Bowden. Geo-sign: Hy- perbolic contrastive regularisation for geometrically aware sign language translation. InThe Thirty-ninth 7 Annual Conference on Neural Information Processing Systems, 2025. 2, 3

work page 2025

[7] [7]

Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Google. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 2, 6

work page 2025

[8] [8]

Handspeak: A sign language dictio- nary online.Reference Reviews, 16(3):21–21, 2002

Ann Grafstein. Handspeak: A sign language dictio- nary online.Reference Reviews, 16(3):21–21, 2002. 3, 5

work page 2002

[9] [9]

SignMusketeers: An efficient multi-stream approach for sign language translation at scale

Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, and Karen Livescu. SignMusketeers: An efficient multi-stream approach for sign language translation at scale. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 22506–22521, Vienna, Austria, 2025. Association for Computational Linguistics. 2

work page 2025

[10] [10]

Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu, and Alexander H. Liu. SHuBERT: Self-supervised sign language representation learning via multi-stream cluster prediction. InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 28792–28810, Vienna, Austria, 2025. Association ...

work page 2025

[11] [11]

Acceler- ate: Training and inference at scale made simple, ef- ficient and adaptable.https://github.com/ huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Man- grulkar, Marc Sun, and Benjamin Bossan. Acceler- ate: Training and inference at scale made simple, ef- ficient and adaptable.https://github.com/ huggingface/accelerate, 2022. 4

work page 2022

[12] [12]

Signbert: Pre-training of hand-model-aware representation for sign language recognition

Hezhen Hu, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. Signbert: Pre-training of hand-model-aware representation for sign language recognition. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 11087–11096, 2021. 3

work page 2021

[13] [13]

Global-local enhancement network for nmf-aware sign language recognition.ACM Trans

Hezhen Hu, Wengang Zhou, Junfu Pu, and Houqiang Li. Global-local enhancement network for nmf-aware sign language recognition.ACM Trans. Multimedia Comput. Commun. Appl., 17(3), 2021. 3

work page 2021

[14] [14]

Hezhen Hu, Weichao Zhao, Wengang Zhou, and Houqiang Li. Signbert+: Hand-model-aware self- supervised pre-training for sign language understand- ing.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(9):11221–11239, 2023. 3

work page 2023

[15] [15]

Attention-based 3d-cnns for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822–2832,

Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. Attention-based 3d-cnns for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822–2832,

work page

[16] [16]

Internvl3.5: Advancing open-source multimodal models in versa- tility, reasoning, and efficiency, 2025

Shanghai AI Laboratory InternVL Team. Internvl3.5: Advancing open-source multimodal models in versa- tility, reasoning, and efficiency, 2025. 2, 6

work page 2025

[17] [17]

Lost in Translation, Found in Embeddings: Sign Lan- guage Translation and Alignment.arXiv e-prints, art

Youngjoon Jang, Liliane Momeni, Zifan Jiang, Joon Son Chung, G ¨ul Varol, and Andrew Zisserman. Lost in Translation, Found in Embeddings: Sign Lan- guage Translation and Alignment.arXiv e-prints, art. arXiv:2512.08040, 2025. 2, 3

work page arXiv 2025

[18] [18]

Visual alignment pre-training for sign language translation

Peiqi Jiao, Yuecong Min, and Xilin Chen. Visual alignment pre-training for sign language translation. InComputer Vision – ECCV 2024, pages 349–367, Cham, 2025. Springer Nature Switzerland. 2

work page 2024

[19] [19]

://arxiv.org/abs/1812.01053, https://arxiv.org/abs/1812.01053 arXiv:1812.01053

Hamid Reza Vaezi Joze and Oscar Koller. Ms- asl: A large-scale data set and benchmark for un- derstanding american sign language.arXiv preprint arXiv:1812.01053, 2018. 3

work page arXiv 2018

[20] [20]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045– 3059, 2021. 3

work page 2021

[21] [21]

Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison

DONGXU LI, Cristian Rodriguez, Xin Yu, and HONGDONG LI. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020. 3

work page 2020

[22] [22]

BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InProceedings of the 40th Inter- national Conference on Machine Learning, pages 19730–19742. PMLR, 2023. 2

work page 2023

[23] [23]

Prefix-tuning: Op- timizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Op- timizing continuous prompts for generation. InPro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 4582–4597,

work page

[24] [24]

Uni-sign: Toward unified sign language understanding at scale

Zecheng Li, Wengang Zhou, Weichao Zhao, Kepeng Wu, Hezhen Hu, and Houqiang Li. Uni-sign: Toward unified sign language understanding at scale. InThe Thirteenth International Conference on Learning Rep- resentations, 2025. 2, 3

work page 2025

[25] [25]

Llava-slt: Visual language tuning for sign language transla- tion.arXiv preprint arXiv:2412.16524, 2024

Han Liang, Chengyu Huang, Yuecheng Xu, Cheng Tang, Weicai Ye, Juze Zhang, Xin Chen, Jingyi Yu, and Lan Xu. Llava-slt: Visual language tun- ing for sign language translation.arXiv preprint arXiv:2412.16524, 2024. 2

work page arXiv 2024

[26] [26]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAd- vances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 2

work page 2023

[27] [27]

Skeleton-based sign language 8 recognition using a dual-stream spatio-temporal dy- namic graph convolutional network.arXiv preprint arXiv:2509.08661, 2025

Liangjin Liu, Haoyang Zheng, Zhengzhong Zhu, and Pei Zhou. Skeleton-based sign language 8 recognition using a dual-stream spatio-temporal dy- namic graph convolutional network.arXiv preprint arXiv:2509.08661, 2025. 3

work page arXiv 2025

[28] [28]

George A. Miller. WordNet: A lexical database for English. InHuman Language Technology: Proceed- ings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994. 3

work page 1994

[29] [29]

Nvidia nemotron nano v2 vl, 2025

NVIDIA. Nvidia nemotron nano v2 vl, 2025. 2, 6

work page 2025

[30] [30]

Openai gpt-5 system card, 2025

OpenAI. Openai gpt-5 system card, 2025. 2, 6

work page 2025

[31] [31]

Qwen2.5 technical report,

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page

[32] [32]

Qwen3-vl technical report, 2025

QwenTeam. Qwen3-vl technical report, 2025. 2, 6

work page 2025

[33] [33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021

[34] [34]

Towards privacy-aware sign language translation at scale

Phillip Rust, Bowen Shi, Skyler Wang, Necati Cihan Camg¨oz, and Jean Maillard. Towards privacy-aware sign language translation at scale. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 8624–8641, 2024. 2, 3

work page 2024

[35] [35]

Autsl: A large scale multi-modal turkish sign lan- guage dataset and baseline methods.IEEE Access, 8: 181340–181355, 2020

Ozge Mercanoglu Sincan and Hacer Yalim Keles. Autsl: A large scale multi-modal turkish sign lan- guage dataset and baseline methods.IEEE Access, 8: 181340–181355, 2020. 3

work page 2020

[36] [36]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural informa- tion processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural informa- tion processing systems, 35:24824–24837, 2022. 3

work page 2022

[38] [38]

Transformers: State-of- the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of- the-art...

work page 2020

[39] [39]

Sign2GPT: Leveraging large language models for gloss-free sign language translation

Ryan Wong, Necati Cihan Camgoz, and Richard Bow- den. Sign2GPT: Leveraging large language models for gloss-free sign language translation. InThe Twelfth In- ternational Conference on Learning Representations,

work page

[40] [40]

Spatial temporal graph convolutional networks for skeleton- based action recognition.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018

Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton- based action recognition.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. 3

work page 2018

[41] [41]

Scal- ing sign language translation

Biao Zhang, Garrett Tanzer, and Orhan Firat. Scal- ing sign language translation. InAdvances in Neu- ral Information Processing Systems, pages 114018– 114047. Curran Associates, Inc., 2024. 2, 3

work page 2024

[42] [42]

Llava-next: A strong zero-shot video under- standing model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chun- yuan Li. Llava-next: A strong zero-shot video under- standing model, 2024. 2, 6

work page 2024

[43] [43]

Conditional prompt learning for vision- language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision- language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 16816–16825, 2022. 3

work page 2022

[44] [44]

Scaling up multimodal pre- training for sign language understanding.CoRR, abs/2408.08544, 2024

Wengang Zhou, Weichao Zhao, Hezhen Hu, Zecheng Li, and Houqiang Li. Scaling up multimodal pre- training for sign language understanding.CoRR, abs/2408.08544, 2024. 3 9

work page arXiv 2024