pith. sign in

arxiv: 2604.18184 · v1 · submitted 2026-04-20 · 💻 cs.CV

CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition

Pith reviewed 2026-05-10 05:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords continuous sign language recognitionmulti-viewteacher-student learningknowledge distillationtemporal modelingviewpoint robustnesssign language datasets
0
0 comments X

The pith

Frontal-view teacher supervision makes continuous sign language recognition robust across multiple camera angles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CanonSLR to overcome the fact that most continuous sign language recognition systems are trained only on frontal views and therefore degrade when the signer is seen from the side or at an angle. It trains a teacher network exclusively on frontal data and uses that network to supply structured temporal guidance to a student network that processes all viewpoints through a new sequence-level soft-target distillation step. An additional temporal motion relational enhancement module is added to emphasize movement patterns while downplaying viewpoint-dependent appearance changes. The authors also release a pipeline that converts existing single-view sign language videos into consistent multi-view versions, creating two new seven-view benchmarks. If the approach works as described, real-world sign language systems could operate reliably without requiring perfectly aligned frontal cameras.

Core claim

The central claim is that a frontal-view-anchored teacher-student architecture, using sequence-level soft-target distillation to transfer canonical temporal knowledge and temporal motion relational enhancement to model stable dynamics, reduces cross-view semantic discrepancies and produces more accurate and robust gloss sequences on non-frontal inputs than prior single-view or multi-view methods.

What carries the argument

The frontal-view-anchored teacher-student learning strategy that supplies canonical temporal supervision from a frontal teacher to a multi-view student.

If this is right

  • The universal multi-view data construction pipeline can turn any single-view sign language corpus into semantically consistent, temporally aligned seven-view data.
  • Sequence-level soft-target distillation transfers structured temporal knowledge and thereby reduces gloss boundary ambiguity on occluded or projected views.
  • Temporal motion relational enhancement produces viewpoint-stable dynamic features by operating on high-level motion relations rather than raw appearance.
  • CanonSLR reports higher recognition accuracy than prior methods on the new PT14-MV and CSL-MV benchmarks, with the largest gains on the most challenging non-frontal angles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same teacher-student structure could be tested on other viewpoint-sensitive video tasks such as gesture or action recognition where only one canonical view is easy to obtain during training.
  • If the synthetic data pipeline introduces subtle timing shifts, real simultaneous multi-camera recordings would be needed to isolate whether the reported robustness truly comes from the distillation or from data artifacts.
  • Combining the canonical-view guidance with skeleton or depth inputs might further suppress appearance disturbances while preserving the temporal supervision mechanism.

Load-bearing premise

The frontal view supplies unbiased canonical temporal supervision that transfers cleanly to non-frontal views without artifacts introduced by the synthetic multi-view video generation pipeline.

What would settle it

Training and testing the same architecture on a dataset of real, simultaneously captured multi-view sign language videos and checking whether non-frontal accuracy still exceeds single-view baselines or whether performance drops due to synthetic-data mismatches.

Figures

Figures reproduced from arXiv: 2604.18184 by Lechao Cheng, Richang Hong, Shengeng Tang, Wan Jiang, Xu Wang, Yaxiong Wang.

Figure 1
Figure 1. Figure 1: Multi-view data curation across seven viewpoints: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multi-view data curation pipeline. (a) Whole-body [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed CanonSLR, a canonical-view guided framework for multi-view CSLR. In Stage I, a frontal [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detailed analysis of the designed CanonSLR on the PT14-MV dataset, where (a), (c), (d) report the performance on [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the same sign sample under multi [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Continuous Sign Language Recognition (CSLR) has achieved remarkable progress in recent years; however, most existing methods are developed under single-view settings and thus remain insufficiently robust to viewpoint variations in real-world scenarios. To address this limitation, we propose CanonSLR, a canonical-view guided framework for multi-view CSLR. Specifically, we introduce a frontal-view-anchored teacher-student learning strategy, in which a teacher network trained on frontal-view data provides canonical temporal supervision for a student network trained on all viewpoints. To further reduce cross-view semantic discrepancy, we propose Sequence-Level Soft-Target Distillation, which transfers structured temporal knowledge from the frontal view to non-frontal samples, thereby alleviating gloss boundary ambiguity and category confusion caused by occlusion and projection variation. In addition, we introduce Temporal Motion Relational Enhancement to explicitly model motion-aware temporal relations in high-level visual features, strengthening stable dynamic representations while suppressing viewpoint-sensitive appearance disturbances. To support multi-view CSLR research, we further develop a universal multi-view sign language data construction pipeline that transforms original single-view RGB videos into semantically consistent, temporally coherent, and viewpoint-controllable multi-view sign language videos. Based on this pipeline, we extend PHOENIX-2014T and CSL-Daily into two seven-view benchmarks, namely PT14-MV and CSL-MV, providing a new experimental foundation for multi-view CSLR. Extensive experiments on PT14-MV and CSL-MV demonstrate that CanonSLR consistently outperforms existing approaches under multi-view settings and exhibits stronger robustness, especially on challenging non-frontal views.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CanonSLR, a teacher-student framework for multi-view continuous sign language recognition (CSLR). A teacher network trained exclusively on frontal-view data supplies canonical temporal supervision to a student network via Sequence-Level Soft-Target Distillation; an additional Temporal Motion Relational Enhancement module models motion-aware relations in high-level features. The authors also present a universal pipeline that synthetically transforms single-view RGB videos into semantically consistent seven-view sequences, yielding new benchmarks PT14-MV and CSL-MV derived from PHOENIX-2014T and CSL-Daily. Experiments on these benchmarks claim that CanonSLR outperforms prior methods, with particular gains in robustness on non-frontal views.

Significance. If the central claims hold, the work would meaningfully advance multi-view CSLR by providing both a practical distillation-based training strategy and the first publicly extensible multi-view benchmarks for the task. The new datasets address a clear gap in existing single-view CSLR research and could serve as a foundation for future viewpoint-robust methods. The combination of frontal-anchored supervision with explicit motion modeling is a reasonable engineering response to occlusion and projection challenges.

major comments (2)
  1. [Data Construction Pipeline] Data Construction Pipeline section: The manuscript describes the universal pipeline that generates PT14-MV and CSL-MV but provides no quantitative validation (e.g., human semantic-consistency ratings, gloss-level alignment metrics, or comparison against real multi-view captures) that the synthetic non-frontal views preserve semantics without introducing systematic, dataset-wide artifacts (projection shifts, occlusion patterns, or temporal interpolation biases). Because the teacher is trained only on the original frontal data and the student receives Sequence-Level Soft-Target Distillation on the transformed views, any consistent synthesis artifact could be exploited by the student, directly threatening the claim that observed robustness gains reflect true viewpoint invariance rather than artifact inversion.
  2. [§3.2] §3.2 (Sequence-Level Soft-Target Distillation) and experimental tables: The distillation loss is applied across views without an explicit mechanism to correct for potential temporal misalignment introduced by the view-synthesis step. An ablation that isolates the contribution of the distillation term versus the Temporal Motion Relational Enhancement module on non-frontal test performance is needed to confirm that the reported gains are not driven by the student simply learning to compensate for pipeline-specific temporal shifts.
minor comments (2)
  1. [Abstract] The abstract states that the pipeline produces 'semantically consistent, temporally coherent' videos; a brief sentence quantifying this consistency (e.g., via inter-view gloss overlap or optical-flow coherence scores) would improve clarity.
  2. [Figures] Figure captions and legends should explicitly label which curves correspond to the proposed CanonSLR variants versus the re-implemented baselines to facilitate direct comparison of non-frontal performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive and detailed review, as well as the positive assessment of the significance of our work. We have carefully addressed each major comment below with honest responses and commit to revisions that strengthen the manuscript without misrepresenting our contributions.

read point-by-point responses
  1. Referee: [Data Construction Pipeline] Data Construction Pipeline section: The manuscript describes the universal pipeline that generates PT14-MV and CSL-MV but provides no quantitative validation (e.g., human semantic-consistency ratings, gloss-level alignment metrics, or comparison against real multi-view captures) that the synthetic non-frontal views preserve semantics without introducing systematic, dataset-wide artifacts (projection shifts, occlusion patterns, or temporal interpolation biases). Because the teacher is trained only on the original frontal data and the student receives Sequence-Level Soft-Target Distillation on the transformed views, any consistent synthesis artifact could be exploited by the student, directly threatening the claim that observed robustness gains reflect true viewpoint invariance rather than artifact inversion.

    Authors: We thank the referee for highlighting this critical point. We agree that quantitative validation of semantic preservation in the synthetic pipeline is necessary to fully support our claims. In the revised manuscript, we will add a dedicated analysis section (or appendix) that includes: (1) human evaluation results with semantic-consistency ratings from multiple annotators on a sampled subset of sequences; (2) gloss-level alignment metrics (e.g., via dynamic time warping on gloss sequences) to measure temporal and semantic fidelity; and (3) a discussion of potential artifacts with supporting evidence from our experiments. Regarding the risk of artifact exploitation, the Sequence-Level Soft-Target Distillation explicitly transfers temporal supervision from the frontal teacher (trained on artifact-free data), which encourages the student to learn invariant representations rather than inverting synthesis-specific patterns. The consistent gains on non-frontal views across different synthesis settings further indicate true robustness. We will clarify this reasoning and include the new validations. revision: yes

  2. Referee: [§3.2] §3.2 (Sequence-Level Soft-Target Distillation) and experimental tables: The distillation loss is applied across views without an explicit mechanism to correct for potential temporal misalignment introduced by the view-synthesis step. An ablation that isolates the contribution of the distillation term versus the Temporal Motion Relational Enhancement module on non-frontal test performance is needed to confirm that the reported gains are not driven by the student simply learning to compensate for pipeline-specific temporal shifts.

    Authors: We appreciate this suggestion for a more targeted ablation. While the manuscript already includes component ablations, we agree that isolating the distillation term's impact specifically on non-frontal performance is valuable. In the revision, we will expand the experimental section with new ablation tables that separately evaluate Sequence-Level Soft-Target Distillation and Temporal Motion Relational Enhancement on non-frontal test sets, demonstrating their individual and combined contributions. On temporal misalignment, the data construction pipeline preserves frame-to-frame coherence via consistent 3D motion reconstruction and rendering, and the sequence-level soft targets are inherently robust to small shifts by emphasizing overall temporal structure rather than precise frame alignment. We will add explicit discussion of this design choice and the requested ablation results to confirm the gains reflect viewpoint invariance. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper presents an empirical framework applying teacher-student distillation and temporal feature enhancement to multi-view CSLR, supported by a new synthetic data pipeline for creating PT14-MV and CSL-MV benchmarks. Performance claims rest on experimental results rather than any closed-form derivation, fitted parameter renamed as prediction, or self-referential definition. No equations reduce claimed robustness gains to inputs by construction, and no load-bearing self-citations or ansatzes are invoked to justify core components. The approach extends standard techniques to a new setting without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on standard assumptions in knowledge distillation and video feature learning; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5593 in / 1000 out tokens · 40729 ms · 2026-05-10T05:06:57.039805+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

  1. [1]

    Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, et al

  2. [2]

    Bbc-oxford british sign language dataset.arXiv preprint arXiv:2111.03635 (2021)

  3. [3]

    Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Wang Yanjun, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al . 2023. Smpler-x: Scaling up expressive human pose and shape estimation.Neural Information Processing Systems36 (2023), 11454–11468

  4. [4]

    Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. Neural sign language translation. InProceedings of the IEEE conference on computer vision and pattern recognition. 7784–7793

  5. [5]

    Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. 2022. A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5120–5130

  6. [6]

    Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. 2022. Two-stream network for sign language recognition and translation.Advances in Neural Information Processing Systems35 (2022), 17043–17056

  7. [7]

    Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. 2020. Fully convolutional networks for continuous sign language recognition. InEuropean Conference on Computer Vision. Springer, 697–714

  8. [8]

    Aashaka Desai, Lauren Berger, Fyodor Minakov, Nessa Milano, Chinmay Singh, Kriston Pumphrey, Richard Ladner, Hal Daumé III, Alex X Lu, Naomi Caselli, et al. 2023. ASL citizen: a community-sourced dataset for advancing isolated sign language recognition.Advances in Neural Information Processing Systems36 (2023), 76893–76907

  9. [9]

    William T Freeman and Michal Roth. 1995. Orientation histograms for hand gesture recognition. InInternational workshop on automatic face and gesture recognition, Vol. 12. Zurich, Switzerland, 296–301

  10. [10]

    Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Hongkai Wen, Lei Xie, and Sanglu Lu

  11. [11]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Signgraph: A sign sequence is worth graphs of nodes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13470–13479

  12. [12]

    Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Kang Xia, Lei Xie, and Sanglu Lu. 2023. Contrastive Learning for Sign Language Recognition and Translation.. InIJCAI. 763–772

  13. [13]

    Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber

  14. [14]

    InProceedings of the 23rd international conference on Machine learning

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning. 369–376

  15. [15]

    Mo Guan, Yan Wang, Guangkun Ma, Jiarui Liu, and Mingzu Sun. 2025. MSKA: Multi-stream keypoint attention network for sign language recognition and translation.Pattern Recognition165 (2025), 111602

  16. [16]

    Shester Gueuwou, Sophie Siake, Colin Leong, and Mathias Müller. 2023. JWSign: A highly multilingual corpus of Bible translations for more diversity in sign language processing.arXiv preprint arXiv:2311.10174(2023)

  17. [17]

    Dan Guo, Shengeng Tang, and Meng Wang. 2019. Connectionist Temporal Modeling of Video and Language: A Joint Model for Translation and Sign Labeling. InInternational Joint Conference on Artificial Intelligence. 751–757

  18. [18]

    Junwei Han, George Awad, and Alistair Sutherland. 2009. Modelling and seg- menting subunits for sign language recognition based on hand motion analysis. Pattern recognition letters30, 6 (2009), 623–633

  19. [19]

    Aiming Hao, Yuecong Min, and Xilin Chen. 2021. Self-mutual distillation learn- ing for continuous sign language recognition. InProceedings of the IEEE/CVF international conference on computer vision. 11303–11312

  20. [20]

    Hezhen Hu, Weichao Zhao, Wengang Zhou, and Houqiang Li. 2023. Signbert+: Hand-model-aware self-supervised pre-training for sign language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence45, 9 (2023), 11221– 11239

  21. [21]

    Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. 2023. Continuous sign language recognition with correlation network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2529–2539

  22. [22]

    Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. 2023. Self-emphasizing network for continuous sign language recognition. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 854–862

  23. [23]

    Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. InPro- ceedings of the AAAI conference on artificial intelligence, Vol. 32

  24. [24]

    Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, and Yun Fu. 2021. Skeleton aware multi-modal sign language recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3413–3423

  25. [25]

    Peiqi Jiao, Yuecong Min, Yanan Li, Xiaotao Wang, Lei Lei, and Xilin Chen. 2023. Cosign: Exploring co-occurrence signals in skeleton-based continuous sign lan- guage recognition. InProceedings of the IEEE/CVF international conference on computer vision. 20676–20686

  26. [26]

    Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, and Ping Luo. 2020. Whole-body human pose estimation in the wild. InEuropean Conference on Computer Vision. Springer, 196–214

  27. [27]

    Hamid Reza Vaezi Joze and Oscar Koller. 2018. Ms-asl: A large-scale data set and benchmark for understanding american sign language.arXiv preprint arXiv:1812.01053(2018)

  28. [28]

    Oscar Koller, Necati Cihan Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos.IEEE transactions on pattern analysis and machine intelligence42, 9 (2019), 2306–2320

  29. [29]

    Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers.Computer vision and image understanding141 (2015), 108–125

  30. [30]

    Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. InProceedings of the IEEE conference on computer vision and pattern recognition. 4297–4305

  31. [31]

    Dongxu Li, Cristian Rodriguez, Xin Yu, and Hongdong Li. 2020. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 1459–1469

  32. [32]

    Zecheng Li, Wengang Zhou, Weichao Zhao, Kepeng Wu, Hezhen Hu, and Houqiang Li. 2025. Uni-Sign: Toward Unified Sign Language Understanding at Scale. InThe Thirteenth International Conference on Learning Representations

  33. [33]

    Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. 2023. One-stage 3d whole-body mesh recovery with component aware transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21159– 21168

  34. [34]

    Yuecong Min, Aiming Hao, Xiujuan Chai, and Xilin Chen. 2021. Visual align- ment constraint for continuous sign language recognition. Inproceedings of the IEEE/CVF international conference on computer vision. 11542–11551

  35. [35]

    Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee

  36. [36]

    6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image

    Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. InEuropean Conference on Computer Vision. Springer, 548–564

  37. [37]

    Priyanka Patel, Chun-Hao P Huang, Joachim Tesch, David T Hoffmann, Shashank Tripathi, and Michael J Black. 2021. AGORA: Avatars in geography optimized for regression analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13468–13478

  38. [38]

    Xin Shen, Heming Du, Hongwei Sheng, Shuyun Wang, Hui Chen, Huiqiang Chen, Zhuojie Wu, Xiaobiao Du, Jiaying Ying, Ruihan Lu, et al. 2024. MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset.Advances in Neural Information Processing Systems37 (2024), 69700– 69715

  39. [39]

    Bowen Shi, Diane Brentari, Gregory Shakhnarovich, and Karen Livescu. 2022. Open-domain sign language translation learned from online video. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 6365–6379

  40. [40]

    Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting.Advances in neural information processing systems28 (2015)

  41. [41]

    Shengeng Tang, Jiayi He, Dan Guo, Yanyan Wei, Feng Li, and Richang Hong

  42. [42]

    In AAAI Conference on Artificial Intelligence, Vol

    Sign-idd: Iconicity disentangled diffusion for sign language production. In AAAI Conference on Artificial Intelligence, Vol. 39. 7266–7274

  43. [43]

    Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, and Richang Hong. 2025. Gloss-driven conditional diffusion models for sign language production.ACM Transactions on Multimedia Computing, Communications and Applications21, 4 (2025), 1–17

  44. [44]

    Garrett Tanzer and Biao Zhang. 2024. YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus.arXiv preprint arXiv:2407.11144(2024)

  45. [45]

    Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. 2025. Stableanimator: High-quality identity-preserving human image animation. InProceedings of the Computer Vision and Pattern Recognition Conference. 21096–21106

  46. [46]

    Dave Uthus, Garrett Tanzer, and Manfred Georg. 2023. Youtube-asl: A large-scale, open-domain american sign language-english parallel corpus.Advances in Neural Information Processing Systems36 (2023), 29029–29047

  47. [47]

    Timo Von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. InProceedings of the European conference on computer vision (ECCV). 601–617

  48. [48]

    Ching-Chen Wang, Ching-Te Chiu, Chao-Tsung Huang, Yu-Chun Ding, and Li-Wei Wang. 2020. Fast and accurate embedded DCNN for RGB-D based sign language recognition. InICASSP 2020-2020 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 1568–1572

  49. [49]

    Fei Wang, Kun Li, Yiqi Nie, Zhangling Duan, Peng Zou, Zhiliang Wu, Yuwei Wang, and Yanyan Wei. 2025. Exploiting ensemble learning for cross-view isolated sign language recognition. InCompanion Proceedings of the ACM on Web Conference

  50. [50]

    Conference’17, July 2017, Washington, DC, USA Xu Wang, Shengeng Tang, Wan Jiang, Yaxiong Wang, Lechao Cheng, and Richang Hong

    2453–2457. Conference’17, July 2017, Washington, DC, USA Xu Wang, Shengeng Tang, Wan Jiang, Yaxiong Wang, Lechao Cheng, and Richang Hong

  51. [51]

    Xu Wang, Shengeng Tang, Peipei Song, Shuo Wang, Dan Guo, and Richang Hong. 2025. Linguistics-vision monotonic consistent network for sign language production. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  52. [52]

    Min Xu, Sheng Liu, Yuan Feng, Yiheng Yu, Zhelun Jin, and Xuhua Yang. 2025. Hier- archical Spatial-Temporal Enhancement Network For Continuous Sign Language Recognition. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  53. [53]

    Yifan Yang, Yuecong Min, and Xilin Chen. 2024. S2net: Skeleton-aware slowfast network for efficient sign language recognition. InProceedings of the Asian Conference on Computer Vision. 319–336

  54. [54]

    Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. 2023. Effective whole- body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4210–4220

  55. [55]

    Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia, Yidong Chen, and Stan Z Li. 2023. Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23141–23150

  56. [56]

    Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. 2021. Improv- ing sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1316–1325

  57. [57]

    Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-temporal multi-cue network for continuous sign language recognition. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 13009–13016

  58. [58]

    Ronglai Zuo and Brian Mak. 2022. C2slr: Consistency-enhanced continuous sign language recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5131–5140