EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition

Rishabh Gupta; Shravya R. Nalla

arxiv: 2604.08694 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.LG

EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition

Rishabh Gupta , Shravya R. Nalla This is my paper

Pith reviewed 2026-05-10 18:12 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords Indian Sign LanguageSign Language RecognitionLightweight ModelAttention MechanismEfficientNetMobile VisionComputer VisionGesture Recognition

0 comments

The pith

EfficientSign reaches 99.94 percent accuracy on Indian Sign Language alphabets using 62 percent fewer parameters than ResNet18.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a lightweight image classifier called EfficientSign by taking the EfficientNet-B0 backbone and inserting two attention modules: one that recalibrates channel importance and another that emphasizes spatial hand regions. On a dataset of 12,637 static images covering all 26 ISL alphabet classes, the model is evaluated with 5-fold cross-validation and reaches 99.94 percent accuracy. This performance equals that of the much larger ResNet18 while cutting the parameter count from 11.2 million to 4.2 million. The authors also show that features extracted from the model support classical classifiers such as SVM at 99.63 percent accuracy, far above earlier SURF-based pipelines.

Core claim

By adding Squeeze-and-Excitation channel attention and a dedicated spatial attention layer to EfficientNet-B0, the resulting EfficientSign network matches the 99.97 percent accuracy of ResNet18 on 12,637 Indian Sign Language alphabet images while using only 4.2 million parameters instead of 11.2 million, and its extracted features allow SVM, logistic regression, and KNN classifiers to exceed 99 percent accuracy.

What carries the argument

The attention-enhanced EfficientNet-B0 that combines Squeeze-and-Excitation blocks for channel reweighting with an added spatial attention layer to highlight hand-gesture locations in each image.

If this is right

The reduced parameter count allows direct deployment of sign-language recognition on mobile phones without cloud support.
Attention layers remove the need for separate hand-detection or hand-crafted feature steps such as SURF.
Deep features from the trained model can be fed to lightweight classical classifiers while retaining over 99 percent accuracy.
The same architecture pattern offers a template for other sign-language or gesture-recognition tasks that must run on edge devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the model with recurrent or transformer layers on video frames would test whether the same efficiency gains hold for continuous signing rather than isolated letters.
Training on a more diverse collection that includes multiple signers and lighting conditions could verify whether the reported accuracy remains stable outside the current dataset.
The attention mechanism's focus on hand regions suggests the architecture could be adapted to other fine-grained gesture datasets with minimal additional tuning.

Load-bearing premise

The collection of 12,637 static alphabet images and the 5-fold cross-validation procedure adequately represent the full range of real-world signer variation, lighting changes, and continuous signing sequences.

What would settle it

Accuracy falling below 95 percent on a held-out set of images captured from previously unseen signers or under changed lighting conditions would show the model does not generalize to practical use.

Figures

Figures reproduced from arXiv: 2604.08694 by Rishabh Gupta, Shravya R. Nalla.

**Figure 1.** Figure 1: shows the overall EfficientSign architecture. The pipeline has four stages: (1) Input preprocessing with data augmentation, (2) Feature extraction using the EfficientNet-B0 backbone, (3) Dual attention refinement through SE and Spatial Attention blocks, and (4) Classification using global average pooling and a fully connected layer. B. Data Preprocessing All images are resized to 224×224 pixels to meet the… view at source ↗

**Figure 2.** Figure 2: 5-Fold Cross-Validation Accuracy Comparison. C. Comparison with Prior Work Method Features Size Acc(%) SVM+SURF+BoF [9] Handcrafted 4,972 92.00 CNN [9] Learned 4,972 78.00 KNN+SURF+BoF [9] Handcrafted 4,972 65.00 Wadhawan [10] Deep(CNN) 35,000 ~99.0 KNN+Deep(Ours) Deep 1280d 12,637 96.33 SVM+Deep(Ours) Deep 1280d 12,637 99.63 MobileNetV2 Deep 12,637 99.93 EfficientSign Deep+Attn 12,637 99.94 ResNet18 Deep … view at source ↗

read the original abstract

How do you build a sign language recognizer that works on a phone? That question drove this work. We built EfficientSign, a lightweight model which takes EfficientNet-B0 and focuses on two attention modules (Squeeze-and-Excitation for channel focus, and a spatial attention layer that focuses on the hand gestures). We tested it against five other approaches on 12,637 images of Indian Sign Language alphabets, all 26 classes, using 5-fold cross-validation. EfficientSign achieves the accuracy of 99.94% (+/-0.05%), which matches the performance of ResNet18's 99.97% accuracy, but with 62% fewer parameters (4.2M vs 11.2M). We also experimented with feeding deep features (1,280-dimensional vectors pulled from EfficientNet-B0's pooling layer) into classical classifiers. SVM achieved the accuracy of 99.63%, Logistic Regression achieved the accuracy of 99.03% and KNN achieved accuracy of 96.33%. All of these blow past the 92% that SURF-based methods managed on a similar dataset back in 2015. Our results show that attention-enhanced learning model provides an efficient and deployable solution for ISL recognition without requiring a massive model or hand-tuned feature pipelines anymore.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EfficientSign hits 99.94% on static ISL images with a lighter model than ResNet18, but the 5-fold CV leaves open whether the splits are signer-disjoint.

read the letter

The main thing to know is that this paper takes EfficientNet-B0, adds standard Squeeze-and-Excitation and spatial attention, and reports 99.94% accuracy on 12,637 static Indian Sign Language alphabet images while using 4.2M parameters instead of ResNet18's 11.2M. They also show SVM on the same features reaches 99.63%, which is close enough that the attention layers are not carrying the whole result. The numbers come with standard deviation from 5-fold cross-validation and beat the 2015 SURF baseline by a wide margin. That is the concrete contribution: a smaller model that works on this particular dataset for a practical accessibility use case in India.

Referee Report

3 major / 1 minor

Summary. The paper introduces EfficientSign, a lightweight CNN derived from EfficientNet-B0 by adding Squeeze-and-Excitation channel attention and a spatial attention module. On a dataset of 12,637 static Indian Sign Language alphabet images (26 classes), it reports 99.94% (±0.05%) accuracy via 5-fold cross-validation, matching ResNet18 (99.97%) while using 62% fewer parameters (4.2M vs 11.2M). Baselines using classical classifiers (SVM, LR, KNN) on 1280-dimensional EfficientNet-B0 features are also presented, all exceeding 96%.

Significance. If the evaluation protocol is shown to be signer-independent, the result would demonstrate that modest architectural additions to a compact backbone can deliver near-ResNet accuracy on ISL alphabets at substantially lower parameter count, supporting mobile deployment. The provision of multiple baselines and standard-deviation reporting from cross-validation are positive elements that allow direct efficiency comparisons.

major comments (3)

[Dataset and Experimental Setup] The dataset description and experimental protocol (implicit in the abstract and results) provide no information on the number of distinct signers, whether images are grouped by signer, or whether the 5-fold splits are signer-disjoint. Given that high accuracies are also obtained by SVM (99.63%) on raw deep features, the reported 99.94% figure risks being inflated by intra-signer leakage rather than measuring generalization across signers, which directly undermines the claim of a deployable real-world solution.
[Training and Implementation Details] No training details are supplied: optimizer, learning-rate schedule, batch size, data augmentation, regularization, or hyperparameter search protocol are absent. Without these, it is impossible to determine whether the near-perfect accuracy stems from the attention modules or from extensive tuning on a relatively homogeneous static-image collection.
[Model Architecture and Results] The manuscript presents EfficientSign as an attention-enhanced architecture but contains no ablation that isolates the contribution of the SE and spatial attention blocks relative to the unmodified EfficientNet-B0 backbone. This omission makes it difficult to attribute the parameter-efficiency claim specifically to the proposed modifications rather than to the choice of base model.

minor comments (1)

[Abstract] The abstract states 'all of these blow past the 92% that SURF-based methods managed on a similar dataset back in 2015' without citing the 2015 reference; a specific citation should be added for traceability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor and architectural validation. We address each major comment below. Where details were omitted from the original manuscript, we will incorporate them in the revision. We note one limitation we cannot fully resolve due to dataset constraints.

read point-by-point responses

Referee: [Dataset and Experimental Setup] The dataset description and experimental protocol provide no information on the number of distinct signers, whether images are grouped by signer, or whether the 5-fold splits are signer-disjoint. Given high accuracies from SVM on deep features, the 99.94% figure risks intra-signer leakage rather than true generalization across signers.

Authors: We agree that signer information and split independence are essential for validating real-world generalization. The manuscript's dataset section will be expanded to state that the 12,637-image collection does not include signer IDs or metadata from its source, preventing confirmation of signer-disjoint folds. We will explicitly note this as a limitation, reframing results as strong within-dataset performance rather than guaranteed cross-signer robustness, and qualify the deployability claim accordingly. The strong SVM baseline on features indicates discriminative representations, but we acknowledge the leakage concern. revision: partial
Referee: [Training and Implementation Details] No training details are supplied: optimizer, learning-rate schedule, batch size, data augmentation, regularization, or hyperparameter search protocol are absent.

Authors: We apologize for this omission, which hinders reproducibility and attribution of gains to the attention modules. The revised manuscript will add a dedicated 'Training Protocol' subsection detailing the optimizer, learning-rate schedule, batch size, data augmentation pipeline, regularization, and hyperparameter selection method used in our experiments. This will allow readers to assess whether the reported accuracy stems primarily from the proposed architecture. revision: yes
Referee: [Model Architecture and Results] The manuscript presents EfficientSign as an attention-enhanced architecture but contains no ablation that isolates the contribution of the SE and spatial attention blocks relative to the unmodified EfficientNet-B0 backbone.

Authors: We concur that an ablation study is required to attribute performance and efficiency gains specifically to the added modules. The revised results section will include a new table and analysis comparing unmodified EfficientNet-B0, variants with only SE blocks, only spatial attention, and the full EfficientSign model, reporting both accuracy and parameter counts. This will directly support the claim that the attention enhancements contribute to the observed efficiency. revision: yes

standing simulated objections not resolved

Absence of signer metadata in the dataset source prevents providing signer-disjoint splits or fully addressing the generalization concern.

Circularity Check

0 steps flagged

No circularity: empirical accuracies from held-out 5-fold CV on static images

full rationale

The paper presents a standard attention-enhanced CNN architecture (EfficientNet-B0 backbone plus SE and spatial attention modules) and evaluates it via direct measurement of classification accuracy on 5-fold cross-validation splits of the 12,637-image dataset. No equations, parameter fits, or derivations are described that would reduce the reported 99.94% accuracy (or the parameter-count comparison to ResNet18) to quantities defined by the model's own outputs or by self-citations. Baseline results (SVM 99.63% on extracted features, etc.) are likewise independent empirical measurements. The central claims rest on held-out test performance rather than any self-referential construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the empirical performance of a known backbone plus attention blocks; no new entities are postulated and no parameters are introduced beyond those already present in EfficientNet-B0 and the attention layers.

axioms (2)

domain assumption The 12,637 images are drawn from a distribution representative of real Indian Sign Language usage
Invoked when claiming deployability from the reported accuracy.
standard math Standard cross-entropy training and 5-fold splits produce unbiased accuracy estimates
Implicit in the 5-fold cross-validation protocol.

pith-pipeline@v0.9.0 · 5538 in / 1412 out tokens · 34420 ms · 2026-05-10T18:12:53.037368+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EfficientSign is built on top of EfficientNet-B0... dual attention refinement through SE and Spatial Attention blocks... 4.2M parameters
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

5-fold stratified cross-validation on 12,637 ISL alphabet images... 99.94% mean accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

1 extracted references · 1 canonical work pages

[1]

Distinctive image features from scale-invariant keypoints,

Abstract—How do you build a sign language recognizer that works on a phone? That question drove this work. We built EfficientSign, a lightweight model which takes EfficientNet-B0 and focuses on two attention modules (Squeeze-and-Excitation for channel focus, and a spatial attention layer that focuses on the hand gestures). We tested it against five other ...

work page 2015

[1] [1]

Distinctive image features from scale-invariant keypoints,

Abstract—How do you build a sign language recognizer that works on a phone? That question drove this work. We built EfficientSign, a lightweight model which takes EfficientNet-B0 and focuses on two attention modules (Squeeze-and-Excitation for channel focus, and a spatial attention layer that focuses on the hand gestures). We tested it against five other ...

work page 2015