arxiv: 2605.08663 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition

Md. Shakhoyat Rahman Shujon , Sheikh Md. Galib Mahim , Md. Milon Islam , Md Rezwanul Haque , Md Rabiul Islam , Hamdi Altaheri , Fakhri Karray

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords sign language recognitionradar sensingrange-time mapscadence velocity diagramstransfer learningattention mechanismdual-stream modelgesture recognition

0 comments

The pith

A dual-stream radar architecture using physics-aware processing achieves 80.5% accuracy for isolated sign language recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CAST, a dual-stream architecture for recognizing isolated sign language gestures from magnitude-only 60 GHz radar Range-Time Maps. It combines an explicit decibel-to-linear inversion and windowed FFT to create clean Cadence Velocity Diagrams, a cross-antenna spatial attention module that operates before convolution, and an asymmetric cross-attention fusion between two pretrained vision backbones. These components together deliver 80.5% top-1 accuracy under 5-fold cross-validation, which is 3.3% higher than the best single-model baseline. Readers would care because this demonstrates a way to adapt vision models to radar data while respecting the underlying physics, potentially enabling more robust and privacy-friendly gesture recognition systems.

Core claim

CAST integrates three physics-aware elements: an inversion from decibel to linear scale followed by windowed fast Fourier transform to extract Cadence Velocity Diagrams without harmonic artifacts, a cross-antenna spatial attention applied to raw antenna channels, and asymmetric cross-attention that fuses representations from a ConvNeXt-Tiny backbone on the velocity diagrams with an EfficientNetV2-S backbone on the range-time maps. This dual-stream setup yields a Top-1 accuracy of 80.5% on a dataset of clinical and alphabetical gestures under 5-fold cross-validation, outperforming the strongest single-model baseline by 3.3%.

What carries the argument

The CAST dual-stream architecture with physics-aware pseudo-image radar processing, cross-antenna spatial attention, and asymmetric cross-attention fusion between CVD and RTM streams.

Load-bearing premise

The accuracy improvement results specifically from the physics-aware inversion, cross-antenna attention, and asymmetric fusion rather than from differences in training procedures or model capacities.

What would settle it

Re-evaluate the single-model baselines using the exact same training protocol, data augmentation, and cross-validation splits as the proposed CAST model to check if the 3.3% gap remains.

Figures

Figures reproduced from arXiv: 2605.08663 by Fakhri Karray, Hamdi Altaheri, Md. Milon Islam, Md Rabiul Islam, Md Rezwanul Haque, Md. Shakhoyat Rahman Shujon, Sheikh Md. Galib Mahim.

**Figure 1.** Figure 1: Overall architecture of the proposed CAST architecture. Three-receiver RTMs are processed through per-stream CASA modules [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: The CASA module. Per-antenna features are globally [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Most-confused pair: 67 N→56 M (7 errors). (a) RTM of a misclassified sample (true label: 67 N). (b) CVD of the same sample. (c) RTM of a correctly classified 56 M sample. (d) CVD of the same. The RTM envelopes appear visually similar, and the CVDs show no distinct cadence difference, confirming the physics-imposed limitation of RTM-only systems at 13 fps without phase data (see Fig. S2 for all confused pai… view at source ↗

read the original abstract

We propose CAST, a dual-stream architecture that utilizes channel-aware spatial transfer learning for isolated sign language recognition addressing the challenges of magnitude-only 60~GHz radar Range-Time Maps (RTM). The proposed framework combines three physics-aware architectures with pretrained vision backbones, which operate under radar-only constraints across clinical and alphabetical gestures. First, an explicit decibel-to-linear inversion is combined with a windowed fast Fourier transform that extracts Cadence Velocity Diagrams (CVD) while avoiding the harmonic artifacts that arise from the spectral analysis of log-compressed signals. Second, a cross-antenna spatial attention module applies attention to raw antenna channels before the convolution, preserving inter-receiver amplitude covariance. Third, an asymmetric cross-attention mechanism fuses representations from parallel ConvNeXt-Tiny (CVD) and EfficientNetV2-S (RTM) backbones. Extensive experiments reveal that the architecture achieves a Top-1 accuracy of 80.5% under 5-fold cross-validation, establishing a 3.3% improvement over the best single-model baseline (77.2%). The findings suggest that physics-aware signal representations form a promising direction for radar-only sign language recognition under constrained sensor modalities. The source code is available at: https://github.com/Shakhoyat/CAST-at-SignEval2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Modest 3.3% accuracy gain on radar sign language from dual-stream setup, but the gain is not clearly tied to the physics-aware modules.

read the letter

This paper reports a 3.3 percentage point improvement in top-1 accuracy for isolated sign language recognition using 60 GHz radar data. The model reaches 80.5% under 5-fold cross validation by running a dual-stream setup that processes both range-time maps and cadence velocity diagrams. The new pieces are the explicit conversion from decibels to linear scale before the FFT to cut down on harmonic artifacts, a spatial attention module that works across antenna channels before any convolution layers, and an asymmetric cross-attention fusion that combines features from ConvNeXt-Tiny on the velocity diagrams with EfficientNetV2-S on the range-time maps. These choices are tied to the radar signal properties rather than generic vision tricks. Releasing the code is helpful for anyone who wants to check or build on it. The main soft spot is the lack of detail on whether the single-model baselines were trained with the same optimizer, schedule, augmentations, and initialization as the full model. The dual-stream design also adds parameters, so it is not clear how much of the gain comes from the proposed modules versus extra capacity or tuning differences. The abstract does not give dataset size, number of classes, or any statistical tests, which makes it hard to judge how solid the 3.3% delta really is. This work is aimed at people working on radar-based gesture recognition for assistive technologies where cameras are not wanted. It is worth sending out for peer review because the architecture is specific and the code is available, but the reviewers will probably ask for ablations and matched training runs before accepting the improvement as coming from the physics-aware parts.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes CAST, a dual-stream architecture for isolated sign language recognition from 60 GHz radar Range-Time Maps (RTM). It combines a dB-to-linear inversion with windowed FFT to extract Cadence Velocity Diagrams (CVD) avoiding harmonic artifacts, a cross-antenna spatial attention module applied before convolution, and asymmetric cross-attention fusion between a ConvNeXt-Tiny stream on CVD and an EfficientNetV2-S stream on RTM. Under 5-fold cross-validation the model reports 80.5% Top-1 accuracy, a 3.3% gain over the best single-model baseline (77.2%). Source code is released.

Significance. If the reported gain can be isolated to the physics-aware inversion, cross-antenna attention, and asymmetric fusion rather than unmatched training protocols or increased model capacity, the work would demonstrate a useful direction for radar-only sign language recognition. The explicit incorporation of radar signal properties into pretrained vision backbones and the public code release are strengths that would support follow-on research in constrained sensor modalities.

major comments (2)

[Abstract and Results] Abstract and experimental results: the central claim of a 3.3% Top-1 accuracy improvement (80.5% vs. 77.2%) under 5-fold CV is presented without any dataset size, class count, statistical tests, error bars, or ablation tables. This prevents verification that the delta arises from the three proposed modules rather than baseline differences.
[Experimental Setup] Experimental setup: no section confirms that the single-model baselines (ConvNeXt-Tiny and EfficientNetV2-S) received identical optimizer, learning-rate schedule, batch size, data augmentation, epoch count, or initialization as the full dual-stream CAST model. The dual-stream design also increases total capacity, so the reported gain cannot yet be attributed specifically to the dB-to-linear inversion, cross-antenna attention, or asymmetric fusion.

minor comments (3)

The manuscript should explicitly state the dataset (number of samples, number of classes, train/test split details) and the exact gesture vocabulary used for the clinical and alphabetical signs.
[Experiments] Add a component-wise ablation table (removing inversion, removing cross-antenna attention, removing asymmetric fusion) to quantify each module's contribution.
[Methods] Clarify the precise implementation of the windowed FFT (window type, length, overlap) and the channel dimensions of the cross-antenna attention module.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and rigor of our experimental claims. We address each major comment point by point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Results] Abstract and experimental results: the central claim of a 3.3% Top-1 accuracy improvement (80.5% vs. 77.2%) under 5-fold CV is presented without any dataset size, class count, statistical tests, error bars, or ablation tables. This prevents verification that the delta arises from the three proposed modules rather than baseline differences.

Authors: We agree that the abstract and results presentation would benefit from these details to allow readers to assess the source of the reported improvement. In the revised manuscript, we will expand the abstract to include the dataset size and class count from the SignEval2026 benchmark. We will also add error bars to the reported accuracies, include statistical significance tests (e.g., paired t-tests or McNemar's test across the 5 folds), and provide ablation tables that systematically isolate the contributions of the dB-to-linear inversion, cross-antenna spatial attention, and asymmetric cross-attention fusion. These additions will help verify that the 3.3% gain is attributable to the proposed physics-aware components. revision: yes
Referee: [Experimental Setup] Experimental setup: no section confirms that the single-model baselines (ConvNeXt-Tiny and EfficientNetV2-S) received identical optimizer, learning-rate schedule, batch size, data augmentation, epoch count, or initialization as the full dual-stream CAST model. The dual-stream design also increases total capacity, so the reported gain cannot yet be attributed specifically to the dB-to-linear inversion, cross-antenna attention, or asymmetric fusion.

Authors: We acknowledge that the manuscript does not explicitly confirm identical training protocols for the baselines in a dedicated section. In the revision, we will add a table or subsection detailing all hyperparameters (optimizer, learning-rate schedule, batch size, data augmentation, epochs, and initialization) and state that they are shared across the single-stream baselines and the full CAST model. To address the capacity concern, we will include an additional ablation comparing the full dual-stream CAST against a dual-stream variant that uses simple feature concatenation (without the proposed attention mechanisms) while keeping total capacity matched. This will help isolate the specific contributions of the physics-aware inversion and attention modules. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical CV accuracy is measured on held-out folds, independent of architecture definitions

full rationale

The paper reports a measured Top-1 accuracy of 80.5% under 5-fold cross-validation on radar sign-language data, compared against single-model baselines using the same pretrained backbones. This is a direct empirical result on external held-out folds, not an equation or parameter that reduces to its own inputs by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the architecture description or results. The three physics-aware modules (dB-to-linear + windowed FFT, cross-antenna attention, asymmetric fusion) are design choices whose performance impact is tested via comparison, not presupposed. Per the hard rules, an empirical result on CV folds with no reduction to fitted parameters or self-citation chains receives score 0. Concerns about unmatched training procedures or capacity are correctness risks, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard transfer-learning assumptions and signal-processing conventions without introducing new free parameters, axioms, or invented entities beyond the described network modules.

axioms (1)

domain assumption Pretrained vision backbones transfer effectively to radar-derived pseudo-images
Invoked by the choice of ConvNeXt-Tiny and EfficientNetV2-S backbones operating on RTM and CVD inputs

pith-pipeline@v0.9.0 · 5574 in / 1158 out tokens · 47015 ms · 2026-05-12T01:30:30.905263+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel / dAlembert_cosh_solution_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

explicit decibel-to-linear inversion ... Rlin = 10^(RdB/20) ... windowed FFT ... avoiding harmonic artifacts that arise from the spectral analysis of log-compressed signals
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-antenna spatial attention ... preserving inter-receiver amplitude covariance ... asymmetric cross-attention fusion

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Multisource approaches to Italian sign language (LIS) recognition: In- sights from the MultiMedaLIS dataset

Gaia Caligiore, Raffaele Mineo, Concetto Spampinato, Egidio Ragonese, Simone Palazzo, and Sabina Fontana. Multisource approaches to Italian sign language (LIS) recognition: In- sights from the MultiMedaLIS dataset. InProceedings of the Tenth Italian Conference on Computational Linguistics (CLiC- it 2024), pages 132–140, Pisa, Italy, 2024. CEUR Workshop Pr...

work page 2024
[2]

CrossViT: Cross-attention multi-scale vision transformer for image clas- sification

Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. CrossViT: Cross-attention multi-scale vision transformer for image clas- sification. InProceedings of the IEEE/CVF International 8 Conference on Computer Vision (ICCV 2021), pages 357–

work page 2021
[3]

Dietterich

Thomas G. Dietterich. Approximate statistical tests for com- paring supervised classification learning algorithms.Neural Computation, 10(7):1895–1923, 1998. 6

work page 1923
[4]

SignEval 2026 challenges results

Ahmed Abul Hasanaath, Raffaele Mineo, Hamzah Luqman, Sarah Alyami, Maad Alowaifeer, Amelia Sorrenti, Gaia Cali- giore, Sabina Fontana, Egidio Ragonese, Giovanni Bellitto, Federica Proietto Salanitri, Concetto Spampinato, Motaz Al- farraj, Mufti Mahmud, Simone Palazzo, and Nour Imane Zeghib. SignEval 2026 challenges results. InProceedings of the IEEE/CVF C...

work page 2026
[5]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7132–7141. IEEE/CVF, 2018. 4, 7

work page 2018
[6]

Milon Islam and Md

Md. Milon Islam and Md. Rezwanul Haque. FusionEnsemble- Net: An attention-based ensemble of spatiotemporal networks for multimodal sign language recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2025), MSLR Workshop, pages 4983–

work page 2025
[7]

IEEE/CVF, 2025. 2, 8

work page 2025
[8]

Averaging weights leads to wider optima and better generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. InProceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI 2018), pages 1–12. AUAI Press, 2018. 5

work page 2018
[9]

C. Jin, X. Meng, X. Li, J. Wang, M. Pan, et al. Rodar: Robust gesture recognition based on mmWave radar under human activity interference.IEEE Transactions on Mobile Computing, 23(12):11735–11749, 2024. 2

work page 2024
[10]

Multimodal Italian sign language recog- nition with radar-video late fusion

Roman Juranek et al. Multimodal Italian sign language recog- nition with radar-video late fusion. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2025), MSLR Workshop, pages 5079–

work page 2025
[11]

IEEE/CVF, 2025. 1, 2

work page 2025
[12]

Human activity classification based on micro-Doppler signatures using a support vector machine.IEEE Transactions on Geoscience and Remote Sensing, 47(5):1328–1337, 2009

Youngwook Kim and Hao Ling. Human activity classification based on micro-Doppler signatures using a support vector machine.IEEE Transactions on Geoscience and Remote Sensing, 47(5):1328–1337, 2009. 2, 7

work page 2009
[13]

Emre Karagozler, Patrick Amihood, Carsten Schwesig, Erik Olson, Hakim Raja, and Ivan Poupyrev

Jaime Lien, Nicholas Gillian, M. Emre Karagozler, Patrick Amihood, Carsten Schwesig, Erik Olson, Hakim Raja, and Ivan Poupyrev. Soli: Ubiquitous gesture sensing with mil- limeter wave radar. InACM SIGGRAPH 2016 Papers, pages 1–19. ACM, 2016. 2

work page 2016
[14]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11976–11986. IEEE/CVF, 2022. 4

work page 2022
[15]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), pages 1–18, 2019. 5

work page 2019
[16]

Sign language recognition for patient-doctor communication: A multimedia/multimodal dataset

Raffaele Mineo, Gaia Caligiore, Concetto Spampinato, Sabina Fontana, Simone Palazzo, and Egidio Ragonese. Sign language recognition for patient-doctor communication: A multimedia/multimodal dataset. InProceedings of the IEEE 8th Forum on Research and Technologies for Society and Industry Innovation (RTSI), pages 202–207. IEEE, 2024. 1, 6

work page 2024
[17]

Text-aligned radar-based sign language recognition for healthcare communication

Raffaele Mineo, Amelia Sorrenti, Gaia Caligiore, Feder- ica Proietto Salanitri, Giovanni Bellitto, Senya Polikovsky, Sabina Fontana, Egidio Ragonese, Concetto Spampinato, and Simone Palazzo. Text-aligned radar-based sign language recognition for healthcare communication. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (...

work page 2025
[18]

Radar-based imaging for sign language recognition in medical communication

Raffaele Mineo, Amelia Sorrenti, Gaia Caligiore, Feder- ica Proietto Salanitri, Giovanni Bellitto, Senya Polikovsky, Sabina Fontana, Egidio Ragonese, Concetto Spampinato, and Simone Palazzo. Radar-based imaging for sign language recognition in medical communication. InProceedings of the 28th International Conference on Medical Image Computing and Computer...

work page 2025
[19]

A benchmark for radar-based Italian sign lan- guage recognition using frequency-domain range-time maps

Raffaele Mineo, Amelia Sorrenti, Gaia Caligiore, Feder- ica Proietto Salanitri, Giovanni Bellitto, Senya Polikovsky, Sabina Fontana, Egidio Ragonese, Concetto Spampinato, and Simone Palazzo. A benchmark for radar-based Italian sign lan- guage recognition using frequency-domain range-time maps. InProceedings of the IEEE/CVF Conference on Computer Vision an...

work page 2026
[20]

Inference for the gen- eralization error.Machine Learning, 52(3):239–281, 2003

Claude Nadeau and Yoshua Bengio. Inference for the gen- eralization error.Machine Learning, 52(3):239–281, 2003. 6

work page 2003
[21]

Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V . Le. SpecAugment: A simple data augmentation method for automatic speech recognition. InProceedings of Interspeech 2019, pages 2613–

work page 2019
[22]

Modality-specific benchmarks and radar range-doppler envelope classification for multimodal isolated sign language recognition

Dmitriy Sazonov, Kamrul Islam, Evie Malaia, and Sevgi Gur- buz. Modality-specific benchmarks and radar range-doppler envelope classification for multimodal isolated sign language recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 5046–5053, 2025. 2, 8

work page 2025
[23]

Rethinking the inception archi- tecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception archi- tecture for computer vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826. IEEE/CVF, 2016. 5

work page 2016
[24]

Mingxing Tan and Quoc V . Le. EfficientNetV2: Smaller models and faster training. InProceedings of the 38th In- ternational Conference on Machine Learning (ICML), pages 10096–10106. PMLR, 2021. 4

work page 2021
[25]

Dynamic gesture recognition based on fmcw millimeter wave radar: Review of methodologies and results.Sensors, 23:7478, 2023

Gaopeng Tang, Tongning Wu, and Congsheng Li. Dynamic gesture recognition based on fmcw millimeter wave radar: Review of methodologies and results.Sensors, 23:7478, 2023. 2

work page 2023
[26]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. InAdvances in Neu- ral Information Processing Systems (NeurIPS), pages 1–10. Curran Associates, 2017. 5 9

work page 2017
[27]

A novel detection and recognition method for con- tinuous hand gesture using FMCW radar.IEEE Access, 8: 167264–167275, 2020

Yong Wang, Aifeng Ren, Mu Zhou, Wei Wang, and Xiaodong Yang. A novel detection and recognition method for con- tinuous hand gesture using FMCW radar.IEEE Access, 8: 167264–167275, 2020. 2

work page 2020
[28]

PyTorch Image Models

Ross Wightman. PyTorch Image Models. https : //github.com/huggingface/pytorch- image- models, 2019. 6

work page 2019
[29]

CBAM: Convolutional block attention module

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV 2018), pages 3–19. Springer, 2018. 4

work page 2018
[30]

CutMix: Regu- larization strategy to train strong classifiers with localizable features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. CutMix: Regu- larization strategy to train strong classifiers with localizable features. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV 2019), pages 6023–6032. IEEE/CVF, 2019. 5

work page 2019
[31]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv:1710.09412, 2017. 5 10 Supplementary Material CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition This document provides additional quantitative and qualitative analysis supplementing the main...

work page internal anchor Pith review arXiv 2017