CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition

Lechao Cheng; Richang Hong; Shengeng Tang; Wan Jiang; Xu Wang; Yaxiong Wang

arxiv: 2604.18184 · v1 · submitted 2026-04-20 · 💻 cs.CV

CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition

Xu Wang , Shengeng Tang , Wan Jiang , Yaxiong Wang , Lechao Cheng , Richang Hong This is my paper

Pith reviewed 2026-05-10 05:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords continuous sign language recognitionmulti-viewteacher-student learningknowledge distillationtemporal modelingviewpoint robustnesssign language datasets

0 comments

The pith

Frontal-view teacher supervision makes continuous sign language recognition robust across multiple camera angles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CanonSLR to overcome the fact that most continuous sign language recognition systems are trained only on frontal views and therefore degrade when the signer is seen from the side or at an angle. It trains a teacher network exclusively on frontal data and uses that network to supply structured temporal guidance to a student network that processes all viewpoints through a new sequence-level soft-target distillation step. An additional temporal motion relational enhancement module is added to emphasize movement patterns while downplaying viewpoint-dependent appearance changes. The authors also release a pipeline that converts existing single-view sign language videos into consistent multi-view versions, creating two new seven-view benchmarks. If the approach works as described, real-world sign language systems could operate reliably without requiring perfectly aligned frontal cameras.

Core claim

The central claim is that a frontal-view-anchored teacher-student architecture, using sequence-level soft-target distillation to transfer canonical temporal knowledge and temporal motion relational enhancement to model stable dynamics, reduces cross-view semantic discrepancies and produces more accurate and robust gloss sequences on non-frontal inputs than prior single-view or multi-view methods.

What carries the argument

The frontal-view-anchored teacher-student learning strategy that supplies canonical temporal supervision from a frontal teacher to a multi-view student.

If this is right

The universal multi-view data construction pipeline can turn any single-view sign language corpus into semantically consistent, temporally aligned seven-view data.
Sequence-level soft-target distillation transfers structured temporal knowledge and thereby reduces gloss boundary ambiguity on occluded or projected views.
Temporal motion relational enhancement produces viewpoint-stable dynamic features by operating on high-level motion relations rather than raw appearance.
CanonSLR reports higher recognition accuracy than prior methods on the new PT14-MV and CSL-MV benchmarks, with the largest gains on the most challenging non-frontal angles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same teacher-student structure could be tested on other viewpoint-sensitive video tasks such as gesture or action recognition where only one canonical view is easy to obtain during training.
If the synthetic data pipeline introduces subtle timing shifts, real simultaneous multi-camera recordings would be needed to isolate whether the reported robustness truly comes from the distillation or from data artifacts.
Combining the canonical-view guidance with skeleton or depth inputs might further suppress appearance disturbances while preserving the temporal supervision mechanism.

Load-bearing premise

The frontal view supplies unbiased canonical temporal supervision that transfers cleanly to non-frontal views without artifacts introduced by the synthetic multi-view video generation pipeline.

What would settle it

Training and testing the same architecture on a dataset of real, simultaneously captured multi-view sign language videos and checking whether non-frontal accuracy still exceeds single-view baselines or whether performance drops due to synthetic-data mismatches.

Figures

Figures reproduced from arXiv: 2604.18184 by Lechao Cheng, Richang Hong, Shengeng Tang, Wan Jiang, Xu Wang, Yaxiong Wang.

**Figure 2.** Figure 2: Multi-view data curation pipeline. (a) Whole-body [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed CanonSLR, a canonical-view guided framework for multi-view CSLR. In Stage I, a frontal [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Detailed analysis of the designed CanonSLR on the PT14-MV dataset, where (a), (c), (d) report the performance on [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the same sign sample under multi [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Continuous Sign Language Recognition (CSLR) has achieved remarkable progress in recent years; however, most existing methods are developed under single-view settings and thus remain insufficiently robust to viewpoint variations in real-world scenarios. To address this limitation, we propose CanonSLR, a canonical-view guided framework for multi-view CSLR. Specifically, we introduce a frontal-view-anchored teacher-student learning strategy, in which a teacher network trained on frontal-view data provides canonical temporal supervision for a student network trained on all viewpoints. To further reduce cross-view semantic discrepancy, we propose Sequence-Level Soft-Target Distillation, which transfers structured temporal knowledge from the frontal view to non-frontal samples, thereby alleviating gloss boundary ambiguity and category confusion caused by occlusion and projection variation. In addition, we introduce Temporal Motion Relational Enhancement to explicitly model motion-aware temporal relations in high-level visual features, strengthening stable dynamic representations while suppressing viewpoint-sensitive appearance disturbances. To support multi-view CSLR research, we further develop a universal multi-view sign language data construction pipeline that transforms original single-view RGB videos into semantically consistent, temporally coherent, and viewpoint-controllable multi-view sign language videos. Based on this pipeline, we extend PHOENIX-2014T and CSL-Daily into two seven-view benchmarks, namely PT14-MV and CSL-MV, providing a new experimental foundation for multi-view CSLR. Extensive experiments on PT14-MV and CSL-MV demonstrate that CanonSLR consistently outperforms existing approaches under multi-view settings and exhibits stronger robustness, especially on challenging non-frontal views.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CanonSLR adds a frontal teacher-student distillation setup and a synthetic multi-view data pipeline to CSLR, but the robustness claims rest on data that may contain exploitable artifacts.

read the letter

CanonSLR's main move is training a teacher only on frontal views and using sequence-level soft-target distillation plus temporal motion relational enhancement to supervise a student on all views. They also release a pipeline that turns single-view videos into seven-view sequences and turn PHOENIX-2014T and CSL-Daily into PT14-MV and CSL-MV benchmarks. That combination is new for this subfield and directly targets the practical problem that most CSLR work ignores viewpoint change. The benchmarks themselves are a concrete service to anyone who wants to test multi-view methods. Experiments in the abstract show gains on non-frontal views, which is the expected direction if the approach works. The distillation step and the motion modeling are standard tools applied sensibly here, and the paper cites the relevant single-view baselines. The soft spot is the data pipeline. The view-synthesis step (projection, occlusion, interpolation) is deterministic across the dataset, so consistent appearance or boundary shifts could let the student learn to invert those specific artifacts rather than acquire viewpoint-invariant temporal features. If that happens, the reported robustness on challenging views would be overstated. The abstract gives no ablation that swaps in real multi-view footage or tests alternative synthesis methods, and no statistical details on the gains. That leaves the central claim vulnerable. This paper is for people already working on continuous sign language recognition who need multi-view testbeds and a simple way to adapt existing single-view pipelines. A reader focused on accessibility or video understanding will find the datasets useful even if they skip the method. It should go to peer review because the idea is straightforward, the data contribution is real, and the experiments can be checked and strengthened in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes CanonSLR, a teacher-student framework for multi-view continuous sign language recognition (CSLR). A teacher network trained exclusively on frontal-view data supplies canonical temporal supervision to a student network via Sequence-Level Soft-Target Distillation; an additional Temporal Motion Relational Enhancement module models motion-aware relations in high-level features. The authors also present a universal pipeline that synthetically transforms single-view RGB videos into semantically consistent seven-view sequences, yielding new benchmarks PT14-MV and CSL-MV derived from PHOENIX-2014T and CSL-Daily. Experiments on these benchmarks claim that CanonSLR outperforms prior methods, with particular gains in robustness on non-frontal views.

Significance. If the central claims hold, the work would meaningfully advance multi-view CSLR by providing both a practical distillation-based training strategy and the first publicly extensible multi-view benchmarks for the task. The new datasets address a clear gap in existing single-view CSLR research and could serve as a foundation for future viewpoint-robust methods. The combination of frontal-anchored supervision with explicit motion modeling is a reasonable engineering response to occlusion and projection challenges.

major comments (2)

[Data Construction Pipeline] Data Construction Pipeline section: The manuscript describes the universal pipeline that generates PT14-MV and CSL-MV but provides no quantitative validation (e.g., human semantic-consistency ratings, gloss-level alignment metrics, or comparison against real multi-view captures) that the synthetic non-frontal views preserve semantics without introducing systematic, dataset-wide artifacts (projection shifts, occlusion patterns, or temporal interpolation biases). Because the teacher is trained only on the original frontal data and the student receives Sequence-Level Soft-Target Distillation on the transformed views, any consistent synthesis artifact could be exploited by the student, directly threatening the claim that observed robustness gains reflect true viewpoint invariance rather than artifact inversion.
[§3.2] §3.2 (Sequence-Level Soft-Target Distillation) and experimental tables: The distillation loss is applied across views without an explicit mechanism to correct for potential temporal misalignment introduced by the view-synthesis step. An ablation that isolates the contribution of the distillation term versus the Temporal Motion Relational Enhancement module on non-frontal test performance is needed to confirm that the reported gains are not driven by the student simply learning to compensate for pipeline-specific temporal shifts.

minor comments (2)

[Abstract] The abstract states that the pipeline produces 'semantically consistent, temporally coherent' videos; a brief sentence quantifying this consistency (e.g., via inter-view gloss overlap or optical-flow coherence scores) would improve clarity.
[Figures] Figure captions and legends should explicitly label which curves correspond to the proposed CanonSLR variants versus the re-implemented baselines to facilitate direct comparison of non-frontal performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive and detailed review, as well as the positive assessment of the significance of our work. We have carefully addressed each major comment below with honest responses and commit to revisions that strengthen the manuscript without misrepresenting our contributions.

read point-by-point responses

Referee: [Data Construction Pipeline] Data Construction Pipeline section: The manuscript describes the universal pipeline that generates PT14-MV and CSL-MV but provides no quantitative validation (e.g., human semantic-consistency ratings, gloss-level alignment metrics, or comparison against real multi-view captures) that the synthetic non-frontal views preserve semantics without introducing systematic, dataset-wide artifacts (projection shifts, occlusion patterns, or temporal interpolation biases). Because the teacher is trained only on the original frontal data and the student receives Sequence-Level Soft-Target Distillation on the transformed views, any consistent synthesis artifact could be exploited by the student, directly threatening the claim that observed robustness gains reflect true viewpoint invariance rather than artifact inversion.

Authors: We thank the referee for highlighting this critical point. We agree that quantitative validation of semantic preservation in the synthetic pipeline is necessary to fully support our claims. In the revised manuscript, we will add a dedicated analysis section (or appendix) that includes: (1) human evaluation results with semantic-consistency ratings from multiple annotators on a sampled subset of sequences; (2) gloss-level alignment metrics (e.g., via dynamic time warping on gloss sequences) to measure temporal and semantic fidelity; and (3) a discussion of potential artifacts with supporting evidence from our experiments. Regarding the risk of artifact exploitation, the Sequence-Level Soft-Target Distillation explicitly transfers temporal supervision from the frontal teacher (trained on artifact-free data), which encourages the student to learn invariant representations rather than inverting synthesis-specific patterns. The consistent gains on non-frontal views across different synthesis settings further indicate true robustness. We will clarify this reasoning and include the new validations. revision: yes
Referee: [§3.2] §3.2 (Sequence-Level Soft-Target Distillation) and experimental tables: The distillation loss is applied across views without an explicit mechanism to correct for potential temporal misalignment introduced by the view-synthesis step. An ablation that isolates the contribution of the distillation term versus the Temporal Motion Relational Enhancement module on non-frontal test performance is needed to confirm that the reported gains are not driven by the student simply learning to compensate for pipeline-specific temporal shifts.

Authors: We appreciate this suggestion for a more targeted ablation. While the manuscript already includes component ablations, we agree that isolating the distillation term's impact specifically on non-frontal performance is valuable. In the revision, we will expand the experimental section with new ablation tables that separately evaluate Sequence-Level Soft-Target Distillation and Temporal Motion Relational Enhancement on non-frontal test sets, demonstrating their individual and combined contributions. On temporal misalignment, the data construction pipeline preserves frame-to-frame coherence via consistent 3D motion reconstruction and rendering, and the sequence-level soft targets are inherently robust to small shifts by emphasizing overall temporal structure rather than precise frame alignment. We will add explicit discussion of this design choice and the requested ablation results to confirm the gains reflect viewpoint invariance. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper presents an empirical framework applying teacher-student distillation and temporal feature enhancement to multi-view CSLR, supported by a new synthetic data pipeline for creating PT14-MV and CSL-MV benchmarks. Performance claims rest on experimental results rather than any closed-form derivation, fitted parameter renamed as prediction, or self-referential definition. No equations reduce claimed robustness gains to inputs by construction, and no load-bearing self-citations or ansatzes are invoked to justify core components. The approach extends standard techniques to a new setting without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on standard assumptions in knowledge distillation and video feature learning; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5593 in / 1000 out tokens · 40729 ms · 2026-05-10T05:06:57.039805+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

[1]

Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, et al

work page
[2]

Bbc-oxford british sign language dataset.arXiv preprint arXiv:2111.03635 (2021)

work page arXiv 2021
[3]

Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Wang Yanjun, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al . 2023. Smpler-x: Scaling up expressive human pose and shape estimation.Neural Information Processing Systems36 (2023), 11454–11468

work page 2023
[4]

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. Neural sign language translation. InProceedings of the IEEE conference on computer vision and pattern recognition. 7784–7793

work page 2018
[5]

Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. 2022. A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5120–5130

work page 2022
[6]

Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. 2022. Two-stream network for sign language recognition and translation.Advances in Neural Information Processing Systems35 (2022), 17043–17056

work page 2022
[7]

Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. 2020. Fully convolutional networks for continuous sign language recognition. InEuropean Conference on Computer Vision. Springer, 697–714

work page 2020
[8]

Aashaka Desai, Lauren Berger, Fyodor Minakov, Nessa Milano, Chinmay Singh, Kriston Pumphrey, Richard Ladner, Hal Daumé III, Alex X Lu, Naomi Caselli, et al. 2023. ASL citizen: a community-sourced dataset for advancing isolated sign language recognition.Advances in Neural Information Processing Systems36 (2023), 76893–76907

work page 2023
[9]

William T Freeman and Michal Roth. 1995. Orientation histograms for hand gesture recognition. InInternational workshop on automatic face and gesture recognition, Vol. 12. Zurich, Switzerland, 296–301

work page 1995
[10]

Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Hongkai Wen, Lei Xie, and Sanglu Lu

work page
[11]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Signgraph: A sign sequence is worth graphs of nodes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13470–13479

work page
[12]

Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Kang Xia, Lei Xie, and Sanglu Lu. 2023. Contrastive Learning for Sign Language Recognition and Translation.. InIJCAI. 763–772

work page 2023
[13]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber

work page
[14]

InProceedings of the 23rd international conference on Machine learning

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning. 369–376

work page
[15]

Mo Guan, Yan Wang, Guangkun Ma, Jiarui Liu, and Mingzu Sun. 2025. MSKA: Multi-stream keypoint attention network for sign language recognition and translation.Pattern Recognition165 (2025), 111602

work page 2025
[16]

Shester Gueuwou, Sophie Siake, Colin Leong, and Mathias Müller. 2023. JWSign: A highly multilingual corpus of Bible translations for more diversity in sign language processing.arXiv preprint arXiv:2311.10174(2023)

work page arXiv 2023
[17]

Dan Guo, Shengeng Tang, and Meng Wang. 2019. Connectionist Temporal Modeling of Video and Language: A Joint Model for Translation and Sign Labeling. InInternational Joint Conference on Artificial Intelligence. 751–757

work page 2019
[18]

Junwei Han, George Awad, and Alistair Sutherland. 2009. Modelling and seg- menting subunits for sign language recognition based on hand motion analysis. Pattern recognition letters30, 6 (2009), 623–633

work page 2009
[19]

Aiming Hao, Yuecong Min, and Xilin Chen. 2021. Self-mutual distillation learn- ing for continuous sign language recognition. InProceedings of the IEEE/CVF international conference on computer vision. 11303–11312

work page 2021
[20]

Hezhen Hu, Weichao Zhao, Wengang Zhou, and Houqiang Li. 2023. Signbert+: Hand-model-aware self-supervised pre-training for sign language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence45, 9 (2023), 11221– 11239

work page 2023
[21]

Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. 2023. Continuous sign language recognition with correlation network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2529–2539

work page 2023
[22]

Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. 2023. Self-emphasizing network for continuous sign language recognition. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 854–862

work page 2023
[23]

Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. InPro- ceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018
[24]

Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, and Yun Fu. 2021. Skeleton aware multi-modal sign language recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3413–3423

work page 2021
[25]

Peiqi Jiao, Yuecong Min, Yanan Li, Xiaotao Wang, Lei Lei, and Xilin Chen. 2023. Cosign: Exploring co-occurrence signals in skeleton-based continuous sign lan- guage recognition. InProceedings of the IEEE/CVF international conference on computer vision. 20676–20686

work page 2023
[26]

Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, and Ping Luo. 2020. Whole-body human pose estimation in the wild. InEuropean Conference on Computer Vision. Springer, 196–214

work page 2020
[27]

Hamid Reza Vaezi Joze and Oscar Koller. 2018. Ms-asl: A large-scale data set and benchmark for understanding american sign language.arXiv preprint arXiv:1812.01053(2018)

work page arXiv 2018
[28]

Oscar Koller, Necati Cihan Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos.IEEE transactions on pattern analysis and machine intelligence42, 9 (2019), 2306–2320

work page 2019
[29]

Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers.Computer vision and image understanding141 (2015), 108–125

work page 2015
[30]

Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. InProceedings of the IEEE conference on computer vision and pattern recognition. 4297–4305

work page 2017
[31]

Dongxu Li, Cristian Rodriguez, Xin Yu, and Hongdong Li. 2020. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 1459–1469

work page 2020
[32]

Zecheng Li, Wengang Zhou, Weichao Zhao, Kepeng Wu, Hezhen Hu, and Houqiang Li. 2025. Uni-Sign: Toward Unified Sign Language Understanding at Scale. InThe Thirteenth International Conference on Learning Representations

work page 2025
[33]

Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. 2023. One-stage 3d whole-body mesh recovery with component aware transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21159– 21168

work page 2023
[34]

Yuecong Min, Aiming Hao, Xiujuan Chai, and Xilin Chen. 2021. Visual align- ment constraint for continuous sign language recognition. Inproceedings of the IEEE/CVF international conference on computer vision. 11542–11551

work page 2021
[35]

Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee

work page
[36]

6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image

Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. InEuropean Conference on Computer Vision. Springer, 548–564

work page
[37]

Priyanka Patel, Chun-Hao P Huang, Joachim Tesch, David T Hoffmann, Shashank Tripathi, and Michael J Black. 2021. AGORA: Avatars in geography optimized for regression analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13468–13478

work page 2021
[38]

Xin Shen, Heming Du, Hongwei Sheng, Shuyun Wang, Hui Chen, Huiqiang Chen, Zhuojie Wu, Xiaobiao Du, Jiaying Ying, Ruihan Lu, et al. 2024. MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset.Advances in Neural Information Processing Systems37 (2024), 69700– 69715

work page 2024
[39]

Bowen Shi, Diane Brentari, Gregory Shakhnarovich, and Karen Livescu. 2022. Open-domain sign language translation learned from online video. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 6365–6379

work page 2022
[40]

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting.Advances in neural information processing systems28 (2015)

work page 2015
[41]

Shengeng Tang, Jiayi He, Dan Guo, Yanyan Wei, Feng Li, and Richang Hong

work page
[42]

In AAAI Conference on Artificial Intelligence, Vol

Sign-idd: Iconicity disentangled diffusion for sign language production. In AAAI Conference on Artificial Intelligence, Vol. 39. 7266–7274

work page
[43]

Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, and Richang Hong. 2025. Gloss-driven conditional diffusion models for sign language production.ACM Transactions on Multimedia Computing, Communications and Applications21, 4 (2025), 1–17

work page 2025
[44]

Garrett Tanzer and Biao Zhang. 2024. YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus.arXiv preprint arXiv:2407.11144(2024)

work page arXiv 2024
[45]

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. 2025. Stableanimator: High-quality identity-preserving human image animation. InProceedings of the Computer Vision and Pattern Recognition Conference. 21096–21106

work page 2025
[46]

Dave Uthus, Garrett Tanzer, and Manfred Georg. 2023. Youtube-asl: A large-scale, open-domain american sign language-english parallel corpus.Advances in Neural Information Processing Systems36 (2023), 29029–29047

work page 2023
[47]

Timo Von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. InProceedings of the European conference on computer vision (ECCV). 601–617

work page 2018
[48]

Ching-Chen Wang, Ching-Te Chiu, Chao-Tsung Huang, Yu-Chun Ding, and Li-Wei Wang. 2020. Fast and accurate embedded DCNN for RGB-D based sign language recognition. InICASSP 2020-2020 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 1568–1572

work page 2020
[49]

Fei Wang, Kun Li, Yiqi Nie, Zhangling Duan, Peng Zou, Zhiliang Wu, Yuwei Wang, and Yanyan Wei. 2025. Exploiting ensemble learning for cross-view isolated sign language recognition. InCompanion Proceedings of the ACM on Web Conference

work page 2025
[50]

Conference’17, July 2017, Washington, DC, USA Xu Wang, Shengeng Tang, Wan Jiang, Yaxiong Wang, Lechao Cheng, and Richang Hong

2453–2457. Conference’17, July 2017, Washington, DC, USA Xu Wang, Shengeng Tang, Wan Jiang, Yaxiong Wang, Lechao Cheng, and Richang Hong

work page 2017
[51]

Xu Wang, Shengeng Tang, Peipei Song, Shuo Wang, Dan Guo, and Richang Hong. 2025. Linguistics-vision monotonic consistent network for sign language production. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025
[52]

Min Xu, Sheng Liu, Yuan Feng, Yiheng Yu, Zhelun Jin, and Xuhua Yang. 2025. Hier- archical Spatial-Temporal Enhancement Network For Continuous Sign Language Recognition. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025
[53]

Yifan Yang, Yuecong Min, and Xilin Chen. 2024. S2net: Skeleton-aware slowfast network for efficient sign language recognition. InProceedings of the Asian Conference on Computer Vision. 319–336

work page 2024
[54]

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. 2023. Effective whole- body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4210–4220

work page 2023
[55]

Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia, Yidong Chen, and Stan Z Li. 2023. Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23141–23150

work page 2023
[56]

Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. 2021. Improv- ing sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1316–1325

work page 2021
[57]

Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-temporal multi-cue network for continuous sign language recognition. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 13009–13016

work page 2020
[58]

Ronglai Zuo and Brian Mak. 2022. C2slr: Consistency-enhanced continuous sign language recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5131–5140

work page 2022

[1] [1]

Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, et al

work page

[2] [2]

Bbc-oxford british sign language dataset.arXiv preprint arXiv:2111.03635 (2021)

work page arXiv 2021

[3] [3]

Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Wang Yanjun, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al . 2023. Smpler-x: Scaling up expressive human pose and shape estimation.Neural Information Processing Systems36 (2023), 11454–11468

work page 2023

[4] [4]

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. Neural sign language translation. InProceedings of the IEEE conference on computer vision and pattern recognition. 7784–7793

work page 2018

[5] [5]

Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. 2022. A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5120–5130

work page 2022

[6] [6]

Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. 2022. Two-stream network for sign language recognition and translation.Advances in Neural Information Processing Systems35 (2022), 17043–17056

work page 2022

[7] [7]

Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. 2020. Fully convolutional networks for continuous sign language recognition. InEuropean Conference on Computer Vision. Springer, 697–714

work page 2020

[8] [8]

Aashaka Desai, Lauren Berger, Fyodor Minakov, Nessa Milano, Chinmay Singh, Kriston Pumphrey, Richard Ladner, Hal Daumé III, Alex X Lu, Naomi Caselli, et al. 2023. ASL citizen: a community-sourced dataset for advancing isolated sign language recognition.Advances in Neural Information Processing Systems36 (2023), 76893–76907

work page 2023

[9] [9]

William T Freeman and Michal Roth. 1995. Orientation histograms for hand gesture recognition. InInternational workshop on automatic face and gesture recognition, Vol. 12. Zurich, Switzerland, 296–301

work page 1995

[10] [10]

Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Hongkai Wen, Lei Xie, and Sanglu Lu

work page

[11] [11]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Signgraph: A sign sequence is worth graphs of nodes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13470–13479

work page

[12] [12]

Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Kang Xia, Lei Xie, and Sanglu Lu. 2023. Contrastive Learning for Sign Language Recognition and Translation.. InIJCAI. 763–772

work page 2023

[13] [13]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber

work page

[14] [14]

InProceedings of the 23rd international conference on Machine learning

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning. 369–376

work page

[15] [15]

Mo Guan, Yan Wang, Guangkun Ma, Jiarui Liu, and Mingzu Sun. 2025. MSKA: Multi-stream keypoint attention network for sign language recognition and translation.Pattern Recognition165 (2025), 111602

work page 2025

[16] [16]

Shester Gueuwou, Sophie Siake, Colin Leong, and Mathias Müller. 2023. JWSign: A highly multilingual corpus of Bible translations for more diversity in sign language processing.arXiv preprint arXiv:2311.10174(2023)

work page arXiv 2023

[17] [17]

Dan Guo, Shengeng Tang, and Meng Wang. 2019. Connectionist Temporal Modeling of Video and Language: A Joint Model for Translation and Sign Labeling. InInternational Joint Conference on Artificial Intelligence. 751–757

work page 2019

[18] [18]

Junwei Han, George Awad, and Alistair Sutherland. 2009. Modelling and seg- menting subunits for sign language recognition based on hand motion analysis. Pattern recognition letters30, 6 (2009), 623–633

work page 2009

[19] [19]

Aiming Hao, Yuecong Min, and Xilin Chen. 2021. Self-mutual distillation learn- ing for continuous sign language recognition. InProceedings of the IEEE/CVF international conference on computer vision. 11303–11312

work page 2021

[20] [20]

Hezhen Hu, Weichao Zhao, Wengang Zhou, and Houqiang Li. 2023. Signbert+: Hand-model-aware self-supervised pre-training for sign language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence45, 9 (2023), 11221– 11239

work page 2023

[21] [21]

Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. 2023. Continuous sign language recognition with correlation network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2529–2539

work page 2023

[22] [22]

Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. 2023. Self-emphasizing network for continuous sign language recognition. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 854–862

work page 2023

[23] [23]

Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. InPro- ceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018

[24] [24]

Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, and Yun Fu. 2021. Skeleton aware multi-modal sign language recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3413–3423

work page 2021

[25] [25]

Peiqi Jiao, Yuecong Min, Yanan Li, Xiaotao Wang, Lei Lei, and Xilin Chen. 2023. Cosign: Exploring co-occurrence signals in skeleton-based continuous sign lan- guage recognition. InProceedings of the IEEE/CVF international conference on computer vision. 20676–20686

work page 2023

[26] [26]

Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, and Ping Luo. 2020. Whole-body human pose estimation in the wild. InEuropean Conference on Computer Vision. Springer, 196–214

work page 2020

[27] [27]

Hamid Reza Vaezi Joze and Oscar Koller. 2018. Ms-asl: A large-scale data set and benchmark for understanding american sign language.arXiv preprint arXiv:1812.01053(2018)

work page arXiv 2018

[28] [28]

Oscar Koller, Necati Cihan Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos.IEEE transactions on pattern analysis and machine intelligence42, 9 (2019), 2306–2320

work page 2019

[29] [29]

Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers.Computer vision and image understanding141 (2015), 108–125

work page 2015

[30] [30]

Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. InProceedings of the IEEE conference on computer vision and pattern recognition. 4297–4305

work page 2017

[31] [31]

Dongxu Li, Cristian Rodriguez, Xin Yu, and Hongdong Li. 2020. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 1459–1469

work page 2020

[32] [32]

Zecheng Li, Wengang Zhou, Weichao Zhao, Kepeng Wu, Hezhen Hu, and Houqiang Li. 2025. Uni-Sign: Toward Unified Sign Language Understanding at Scale. InThe Thirteenth International Conference on Learning Representations

work page 2025

[33] [33]

Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. 2023. One-stage 3d whole-body mesh recovery with component aware transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21159– 21168

work page 2023

[34] [34]

Yuecong Min, Aiming Hao, Xiujuan Chai, and Xilin Chen. 2021. Visual align- ment constraint for continuous sign language recognition. Inproceedings of the IEEE/CVF international conference on computer vision. 11542–11551

work page 2021

[35] [35]

Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee

work page

[36] [36]

6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image

Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. InEuropean Conference on Computer Vision. Springer, 548–564

work page

[37] [37]

Priyanka Patel, Chun-Hao P Huang, Joachim Tesch, David T Hoffmann, Shashank Tripathi, and Michael J Black. 2021. AGORA: Avatars in geography optimized for regression analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13468–13478

work page 2021

[38] [38]

Xin Shen, Heming Du, Hongwei Sheng, Shuyun Wang, Hui Chen, Huiqiang Chen, Zhuojie Wu, Xiaobiao Du, Jiaying Ying, Ruihan Lu, et al. 2024. MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset.Advances in Neural Information Processing Systems37 (2024), 69700– 69715

work page 2024

[39] [39]

Bowen Shi, Diane Brentari, Gregory Shakhnarovich, and Karen Livescu. 2022. Open-domain sign language translation learned from online video. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 6365–6379

work page 2022

[40] [40]

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting.Advances in neural information processing systems28 (2015)

work page 2015

[41] [41]

Shengeng Tang, Jiayi He, Dan Guo, Yanyan Wei, Feng Li, and Richang Hong

work page

[42] [42]

In AAAI Conference on Artificial Intelligence, Vol

Sign-idd: Iconicity disentangled diffusion for sign language production. In AAAI Conference on Artificial Intelligence, Vol. 39. 7266–7274

work page

[43] [43]

Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, and Richang Hong. 2025. Gloss-driven conditional diffusion models for sign language production.ACM Transactions on Multimedia Computing, Communications and Applications21, 4 (2025), 1–17

work page 2025

[44] [44]

Garrett Tanzer and Biao Zhang. 2024. YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus.arXiv preprint arXiv:2407.11144(2024)

work page arXiv 2024

[45] [45]

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. 2025. Stableanimator: High-quality identity-preserving human image animation. InProceedings of the Computer Vision and Pattern Recognition Conference. 21096–21106

work page 2025

[46] [46]

Dave Uthus, Garrett Tanzer, and Manfred Georg. 2023. Youtube-asl: A large-scale, open-domain american sign language-english parallel corpus.Advances in Neural Information Processing Systems36 (2023), 29029–29047

work page 2023

[47] [47]

Timo Von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. InProceedings of the European conference on computer vision (ECCV). 601–617

work page 2018

[48] [48]

Ching-Chen Wang, Ching-Te Chiu, Chao-Tsung Huang, Yu-Chun Ding, and Li-Wei Wang. 2020. Fast and accurate embedded DCNN for RGB-D based sign language recognition. InICASSP 2020-2020 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 1568–1572

work page 2020

[49] [49]

Fei Wang, Kun Li, Yiqi Nie, Zhangling Duan, Peng Zou, Zhiliang Wu, Yuwei Wang, and Yanyan Wei. 2025. Exploiting ensemble learning for cross-view isolated sign language recognition. InCompanion Proceedings of the ACM on Web Conference

work page 2025

[50] [50]

Conference’17, July 2017, Washington, DC, USA Xu Wang, Shengeng Tang, Wan Jiang, Yaxiong Wang, Lechao Cheng, and Richang Hong

2453–2457. Conference’17, July 2017, Washington, DC, USA Xu Wang, Shengeng Tang, Wan Jiang, Yaxiong Wang, Lechao Cheng, and Richang Hong

work page 2017

[51] [51]

Xu Wang, Shengeng Tang, Peipei Song, Shuo Wang, Dan Guo, and Richang Hong. 2025. Linguistics-vision monotonic consistent network for sign language production. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025

[52] [52]

Min Xu, Sheng Liu, Yuan Feng, Yiheng Yu, Zhelun Jin, and Xuhua Yang. 2025. Hier- archical Spatial-Temporal Enhancement Network For Continuous Sign Language Recognition. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025

[53] [53]

Yifan Yang, Yuecong Min, and Xilin Chen. 2024. S2net: Skeleton-aware slowfast network for efficient sign language recognition. InProceedings of the Asian Conference on Computer Vision. 319–336

work page 2024

[54] [54]

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. 2023. Effective whole- body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4210–4220

work page 2023

[55] [55]

Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia, Yidong Chen, and Stan Z Li. 2023. Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23141–23150

work page 2023

[56] [56]

Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. 2021. Improv- ing sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1316–1325

work page 2021

[57] [57]

Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-temporal multi-cue network for continuous sign language recognition. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 13009–13016

work page 2020

[58] [58]

Ronglai Zuo and Brian Mak. 2022. C2slr: Consistency-enhanced continuous sign language recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5131–5140

work page 2022