Development of ML model for triboelectric nanogenerator based sign language detection system

Bikash Baro; Meshv Patel; Mohendra Roy; Sayan Bayan

arxiv: 2604.06220 · v1 · submitted 2026-03-26 · 📡 eess.SP · cs.AI· cs.SD

Development of ML model for triboelectric nanogenerator based sign language detection system

Meshv Patel , Bikash Baro , Sayan Bayan , Mohendra Roy This is my paper

Pith reviewed 2026-05-15 00:57 UTC · model grok-4.3

classification 📡 eess.SP cs.AIcs.SD

keywords triboelectric nanogeneratorsign language recognitionMFCC featuresCNN-LSTMwearable sensorsgesture recognitiontime series classificationassistive technology

0 comments

The pith

A MFCC CNN-LSTM model on TENG sensor glove data achieves 93.33% accuracy for recognizing 11 sign language classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks machine learning and deep learning models for sign language recognition using data from a triboelectric nanogenerator glove equipped with five flex sensors. It shows that processing frequency-domain MFCC features from each sensor through separate convolutional branches, then fusing them in an LSTM, yields 93.33% accuracy and 95.56% precision. This represents a substantial gain over traditional algorithms like random forest at 70.38%. The work highlights the value of frequency features for handling variations in signing speed and the importance of data augmentation for model generalization in wearable gesture systems.

Core claim

The proposed multi-sensor MFCC CNN-LSTM architecture processes frequency-domain features from each sensor through independent convolutional branches before fusion and achieves 93.33% accuracy and 95.56% precision on 11 sign classes, a 23-point improvement over the best ML algorithm.

What carries the argument

The multi-sensor MFCC CNN-LSTM architecture that extracts Mel-frequency cepstral coefficients from each of the five TENG sensor channels, processes them independently with CNN branches, and fuses the outputs for classification.

If this is right

MFCC feature extraction creates execution-speed-invariant representations from the time-series sensor data.
Using 50-timestep windows balances temporal context and data volume better than longer windows for this task.
Data augmentation techniques such as time warping and noise injection improve the model's ability to generalize.
The parallel multi-sensor processing outperforms both classical ML and standard time-domain deep learning for this wearable application.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar architectures could apply to other multi-sensor wearable systems for real-time gesture or activity detection without cameras.
Expanding the dataset to include more sign classes and diverse users would test scalability for practical assistive devices.
The frequency-domain approach might reduce sensitivity to individual signing styles, enabling broader adoption in communication aids.

Load-bearing premise

The dataset collected from the custom TENG glove adequately represents variations in signing speed, style, and different users for the 11 classes.

What would settle it

Testing the trained model on sign language data collected from new users performing the signs at different speeds and observing whether accuracy falls significantly below 93% would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.06220 by Bikash Baro, Meshv Patel, Mohendra Roy, Sayan Bayan.

**Figure 1.** Figure 1: Schematic representation and real image of the fabri [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Generated output voltage for sign language F [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Comparative performance of all models across different [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: MFCC CNN-LSTM training performance across 78 [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 7.** Figure 7: Training accuracy and loss across epochs for different [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Performance comparison of traditional ML models [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 9.** Figure 9: Comprehensive LSTM training performance across 107 [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

read the original abstract

Sign language recognition (SLR) is vital for bridging communication gaps between deaf and hearing communities. Vision-based approaches suffer from occlusion, computational costs, and physical constraints. This work presents a comparison of machine learning (ML) and deep learning models for a custom triboelectric nanogenerator (TENG)-based sensor glove. Utilizing multivariate time-series data from five flex sensors, the study benchmarks traditional ML algorithms, feedforward neural networks, LSTM-based temporal models, and a multi-sensor MFCC CNN-LSTM architecture across 11 sign classes (digits 1-5, letters A-F). The proposed MFCC CNN-LSTM architecture processes frequency-domain features from each sensor through independent convolutional branches before fusion. It achieves 93.33% accuracy and 95.56% precision, a 23-point improvement over the best ML algorithm (Random Forest: 70.38%). Ablation studies reveal 50-timestep windows offer a tradeoff between temporal context and training data volume, yielding 84.13% accuracy compared to 58.06% with 100-timestep windows. MFCC feature extraction maps temporal variations to execution-speed-invariant spectral representations, and data augmentation methods (time warping, noise injection) are essential for generalization. Results demonstrate that frequency-domain feature representations combined with parallel multi-sensor processing architectures offer enhancement over classical algorithms and time-domain deep learning for wearable sensor-based gesture recognition. This aids assistive technology development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TENG glove with MFCC-CNN-LSTM hits 93% on 11 signs but the accuracy likely does not survive user-independent splits.

read the letter

The paper's main result is a five-sensor TENG glove that feeds MFCC features from each channel into separate convolutional branches, fuses them, and runs an LSTM to classify 11 signs at 93.33% accuracy and 95.56% precision. That beats their random-forest baseline by 23 points and shows a clear ablation where 50-timestep windows work better than 100. The sensor hardware and the exact multi-branch layout on TENG data are new enough to count as a practical extension of existing gesture work. The authors also report that time-warping and noise augmentation help, which is useful to see spelled out. What the paper does cleanly is lay out the model comparisons and the window-size tradeoff with concrete numbers. The soft spot is exactly what the stress-test note flags: the abstract and available details give no subject count, no trial count per sign, and no statement that the train-test split is user-independent. If the data come from a handful of collectors and the split mixes their trials, the 93% figure is probably picking up person-specific patterns in signing speed and pressure rather than learning transferable representations. MFCC is supposed to remove speed dependence, but without cross-user validation or error bars that claim stays untested. The 50-timestep drop to 84% already hints at sensitivity to data volume. This work is for people building wearable assistive devices who want a concrete sensor-plus-model example they can try to reproduce. It is not a broad methodological advance. I would send it to peer review because the hardware is real, the numbers are reported, and the architecture choice is worth checking, but any referee will need to press on the dataset protocol and generalization tests before the performance claim can be trusted.

Referee Report

3 major / 1 minor

Summary. The paper develops ML and DL models for sign language recognition using a custom TENG-based sensor glove with five flex sensors. It benchmarks traditional algorithms (e.g., Random Forest at 70.38%), feedforward nets, LSTMs, and proposes a multi-sensor MFCC CNN-LSTM that processes frequency-domain features per sensor before fusion. On 11 classes (digits 1-5, letters A-F), the proposed model reports 93.33% accuracy and 95.56% precision. Ablations compare 50-timestep (84.13%) vs. 100-timestep (58.06%) windows and emphasize MFCC for speed-invariance plus data augmentation for generalization.

Significance. If the performance claims hold under proper validation, the work is significant for wearable, non-vision SLR systems. It demonstrates concrete gains from frequency-domain features and parallel multi-sensor CNN branches over time-domain baselines, with ablations that highlight tradeoffs in temporal context vs. data volume. This could guide TENG sensor integration in assistive devices, provided generalization is established.

major comments (3)

[Experimental Setup / Results] Dataset and validation protocol details are absent: no subject count, trials per sign, or cross-validation method (e.g., leave-one-subject-out) is reported. This directly undermines the central 93.33% accuracy claim, as the 50-timestep ablation already shows high sensitivity to window length and data volume, suggesting possible overfitting to collection-specific patterns rather than user-invariant representations.
[Results] Results lack error bars, standard deviations, or statistical significance tests for the reported accuracies (93.33%, 84.13%, 70.38%) and the 23-point improvement. Without these, the ablation findings and model superiority cannot be assessed reliably.
[Feature Extraction / Discussion] The claim that MFCC yields execution-speed-invariant features is stated but unsupported by explicit tests (e.g., controlled speed variations or comparisons to raw time-series under speed changes). This is load-bearing for the architecture's advantage over time-domain models.

minor comments (1)

[Abstract / Methods] The abstract and text mention data augmentation (time warping, noise injection) but provide no quantitative ablation on their contribution to the final accuracy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that several aspects of the experimental reporting require clarification and strengthening. We will revise the manuscript to address each point as detailed below.

read point-by-point responses

Referee: [Experimental Setup / Results] Dataset and validation protocol details are absent: no subject count, trials per sign, or cross-validation method (e.g., leave-one-subject-out) is reported. This directly undermines the central 93.33% accuracy claim, as the 50-timestep ablation already shows high sensitivity to window length and data volume, suggesting possible overfitting to collection-specific patterns rather than user-invariant representations.

Authors: We agree that these details were omitted from the original submission and that they are essential for assessing generalization. In the revised manuscript we will report the number of subjects, trials per sign class, and the exact cross-validation procedure employed. We will also expand the discussion of the ablation results to address the risk of overfitting and explain how data augmentation was used to promote user-invariant representations. revision: yes
Referee: [Results] Results lack error bars, standard deviations, or statistical significance tests for the reported accuracies (93.33%, 84.13%, 70.38%) and the 23-point improvement. Without these, the ablation findings and model superiority cannot be assessed reliably.

Authors: We acknowledge the need for statistical rigor. The revised results section will include standard deviations across repeated training runs, error bars on all reported accuracies, and appropriate statistical tests (e.g., paired t-tests) to support the claimed improvements and ablation comparisons. revision: yes
Referee: [Feature Extraction / Discussion] The claim that MFCC yields execution-speed-invariant features is stated but unsupported by explicit tests (e.g., controlled speed variations or comparisons to raw time-series under speed changes). This is load-bearing for the architecture's advantage over time-domain models.

Authors: While MFCC features are widely recognized for providing robustness to temporal scaling in signal processing, we agree that an explicit demonstration on our dataset is required. We will add controlled experiments that vary sign execution speed and directly compare MFCC-based inputs against raw time-series inputs under those conditions, with results reported in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements on held-out test data

full rationale

The paper presents an empirical ML benchmarking study on a custom-collected TENG sensor dataset for 11 sign classes. Reported metrics (93.33% accuracy for MFCC CNN-LSTM, 70.38% for Random Forest) are obtained by training models on training splits and evaluating on test splits. Ablation studies on window lengths (50 vs 100 timesteps) and feature choices (MFCC vs time-domain) are likewise direct performance measurements. No equations, derivations, or self-citations reduce any reported accuracy or precision to a fitted parameter or input by construction. The architecture description and augmentation methods are standard and do not create self-referential loops. The central claims rest on observable test-set performance rather than any definitional or fitted-input equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine learning assumptions for time-series classification and the domain-specific effectiveness of MFCC for speed invariance. No new physical entities are postulated; the TENG sensors draw from existing technology.

free parameters (1)

timestep window size = 50
Ablation-selected value of 50 timesteps as tradeoff between temporal context and training data volume

axioms (2)

domain assumption MFCC feature extraction maps temporal variations to execution-speed-invariant spectral representations
Invoked to justify frequency-domain processing for handling variable signing speeds
domain assumption The collected data from five sensors is sufficient and representative for training generalizable models across the 11 classes
Underlying all model training, ablation, and performance claims

pith-pipeline@v0.9.0 · 5569 in / 1507 out tokens · 68309 ms · 2026-05-15T00:57:07.817952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean Jcost definition and uniqueness unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

50-timestep windows offer a tradeoff... 84.13% accuracy compared to 58.06% with 100-timestep windows; MFCC feature extraction maps temporal variations to execution-speed-invariant spectral representations
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat 8-tick / period forcing unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-sensor MFCC CNN-LSTM architecture processes frequency-domain features from each sensor through independent convolutional branches before fusion

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Sign language recognition, generation, and translation: An interdisciplinary perspective,

D. Bragg et al., “Sign language recognition, generation, and translation: An interdisciplinary perspective,” inProc. 21st Int. ACM SIGACCESS Conf. Comput. Accessibility, 2019, pp. 16–31

work page 2019
[2]

Sign language recognition: A deep survey,

R. Rastgoo, K. Kiani, and S. Escalera, “Sign language recognition: A deep survey,”Expert Syst. Appl., vol. 164, p. 113794, 2021

work page 2021
[3]

A survey of glove-based systems and their applications,

L. Dipietro, A. M. Sabatini, and P. Dario, “A survey of glove-based systems and their applications,”IEEE Trans. Syst., Man, Cybern. C, vol. 38, no. 4, pp. 461–482, 2008

work page 2008
[4]

Deep learning for time series classification: A review,

H. I. Fawaz et al., “Deep learning for time series classification: A review,”Data Min. Knowl. Discov., vol. 33, no. 4, pp. 917–963, 2019

work page 2019
[5]

An empirical survey of data augmentation for time series classification with neural networks,

B. K. Iwana and S. Uchida, “An empirical survey of data augmentation for time series classification with neural networks,”PLOS ONE, vol. 16, no. 7, p. e0254841, 2021

work page 2021
[6]

Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos,

O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 9, pp. 2306–2320, 2020

work page 2020
[7]

Real-time American sign language recognition using desk and wearable computer based video,

T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, 1998

work page 1998
[8]

Speech recognition with deep recurrent neural networks,

A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” inProc. ICASSP, 2013, pp. 6645–6649

work page 2013
[9]

Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition,

F. J. Ord ´o˜nez and D. Roggen, “Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition,”Sensors, vol. 16, no. 1, p. 115, 2016

work page 2016
[10]

Deep learning for human activity recognition: A resource efficient implementation on low-power devices,

D. Ravi, C. Wong, B. Lo, and G. Z. Yang, “Deep learning for human activity recognition: A resource efficient implementation on low-power devices,” inProc. IEEE BSN, 2016, pp. 71–76

work page 2016
[11]

Data augmentation of wearable sensor data for Parkinson’s disease monitoring using convolutional neural networks,

T. T. Um et al., “Data augmentation of wearable sensor data for Parkinson’s disease monitoring using convolutional neural networks,” inProc. 19th ACM Int. Conf. Multimodal Interact., 2017, pp. 216–220

work page 2017
[12]

Scikit-learn: Machine learning in Python,

F. Pedregosaet al., “Scikit-learn: Machine learning in Python,”J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011

work page 2011
[13]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inProc. ICML, 2015, pp. 448–456

work page 2015
[14]

Comparison of parametric representations for monosyllabic word recognition,

S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition,”IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, 1980

work page 1980
[15]

Comparative evaluation of various MFCC implementations on the speaker verification task,

T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative evaluation of various MFCC implementations on the speaker verification task,” in Proc. SPECOM, vol. 1, 2005, pp. 191–194

work page 2005
[16]

Dataglove for sign language recognition via wearable inertial sensors,

A. Ji et al., “Dataglove for sign language recognition via wearable inertial sensors,”Sensors, vol. 23, no. 15, p. 6693, 2023

work page 2023
[17]

Machine-learning-assisted soft fiber optic glove system for sign language recognition,

R. Zhu et al., “Machine-learning-assisted soft fiber optic glove system for sign language recognition,”IEEE Robot. Autom. Lett., vol. 9, no. 2, pp. 1540–1547, 2024

work page 2024

[1] [1]

Sign language recognition, generation, and translation: An interdisciplinary perspective,

D. Bragg et al., “Sign language recognition, generation, and translation: An interdisciplinary perspective,” inProc. 21st Int. ACM SIGACCESS Conf. Comput. Accessibility, 2019, pp. 16–31

work page 2019

[2] [2]

Sign language recognition: A deep survey,

R. Rastgoo, K. Kiani, and S. Escalera, “Sign language recognition: A deep survey,”Expert Syst. Appl., vol. 164, p. 113794, 2021

work page 2021

[3] [3]

A survey of glove-based systems and their applications,

L. Dipietro, A. M. Sabatini, and P. Dario, “A survey of glove-based systems and their applications,”IEEE Trans. Syst., Man, Cybern. C, vol. 38, no. 4, pp. 461–482, 2008

work page 2008

[4] [4]

Deep learning for time series classification: A review,

H. I. Fawaz et al., “Deep learning for time series classification: A review,”Data Min. Knowl. Discov., vol. 33, no. 4, pp. 917–963, 2019

work page 2019

[5] [5]

An empirical survey of data augmentation for time series classification with neural networks,

B. K. Iwana and S. Uchida, “An empirical survey of data augmentation for time series classification with neural networks,”PLOS ONE, vol. 16, no. 7, p. e0254841, 2021

work page 2021

[6] [6]

Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos,

O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 9, pp. 2306–2320, 2020

work page 2020

[7] [7]

Real-time American sign language recognition using desk and wearable computer based video,

T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, 1998

work page 1998

[8] [8]

Speech recognition with deep recurrent neural networks,

A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” inProc. ICASSP, 2013, pp. 6645–6649

work page 2013

[9] [9]

Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition,

F. J. Ord ´o˜nez and D. Roggen, “Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition,”Sensors, vol. 16, no. 1, p. 115, 2016

work page 2016

[10] [10]

Deep learning for human activity recognition: A resource efficient implementation on low-power devices,

D. Ravi, C. Wong, B. Lo, and G. Z. Yang, “Deep learning for human activity recognition: A resource efficient implementation on low-power devices,” inProc. IEEE BSN, 2016, pp. 71–76

work page 2016

[11] [11]

Data augmentation of wearable sensor data for Parkinson’s disease monitoring using convolutional neural networks,

T. T. Um et al., “Data augmentation of wearable sensor data for Parkinson’s disease monitoring using convolutional neural networks,” inProc. 19th ACM Int. Conf. Multimodal Interact., 2017, pp. 216–220

work page 2017

[12] [12]

Scikit-learn: Machine learning in Python,

F. Pedregosaet al., “Scikit-learn: Machine learning in Python,”J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011

work page 2011

[13] [13]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inProc. ICML, 2015, pp. 448–456

work page 2015

[14] [14]

Comparison of parametric representations for monosyllabic word recognition,

S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition,”IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, 1980

work page 1980

[15] [15]

Comparative evaluation of various MFCC implementations on the speaker verification task,

T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative evaluation of various MFCC implementations on the speaker verification task,” in Proc. SPECOM, vol. 1, 2005, pp. 191–194

work page 2005

[16] [16]

Dataglove for sign language recognition via wearable inertial sensors,

A. Ji et al., “Dataglove for sign language recognition via wearable inertial sensors,”Sensors, vol. 23, no. 15, p. 6693, 2023

work page 2023

[17] [17]

Machine-learning-assisted soft fiber optic glove system for sign language recognition,

R. Zhu et al., “Machine-learning-assisted soft fiber optic glove system for sign language recognition,”IEEE Robot. Autom. Lett., vol. 9, no. 2, pp. 1540–1547, 2024

work page 2024