pith. machine review for the scientific record. sign in

arxiv: 2605.08663 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords sign language recognitionradar sensingrange-time mapscadence velocity diagramstransfer learningattention mechanismdual-stream modelgesture recognition
0
0 comments X

The pith

A dual-stream radar architecture using physics-aware processing achieves 80.5% accuracy for isolated sign language recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CAST, a dual-stream architecture for recognizing isolated sign language gestures from magnitude-only 60 GHz radar Range-Time Maps. It combines an explicit decibel-to-linear inversion and windowed FFT to create clean Cadence Velocity Diagrams, a cross-antenna spatial attention module that operates before convolution, and an asymmetric cross-attention fusion between two pretrained vision backbones. These components together deliver 80.5% top-1 accuracy under 5-fold cross-validation, which is 3.3% higher than the best single-model baseline. Readers would care because this demonstrates a way to adapt vision models to radar data while respecting the underlying physics, potentially enabling more robust and privacy-friendly gesture recognition systems.

Core claim

CAST integrates three physics-aware elements: an inversion from decibel to linear scale followed by windowed fast Fourier transform to extract Cadence Velocity Diagrams without harmonic artifacts, a cross-antenna spatial attention applied to raw antenna channels, and asymmetric cross-attention that fuses representations from a ConvNeXt-Tiny backbone on the velocity diagrams with an EfficientNetV2-S backbone on the range-time maps. This dual-stream setup yields a Top-1 accuracy of 80.5% on a dataset of clinical and alphabetical gestures under 5-fold cross-validation, outperforming the strongest single-model baseline by 3.3%.

What carries the argument

The CAST dual-stream architecture with physics-aware pseudo-image radar processing, cross-antenna spatial attention, and asymmetric cross-attention fusion between CVD and RTM streams.

Load-bearing premise

The accuracy improvement results specifically from the physics-aware inversion, cross-antenna attention, and asymmetric fusion rather than from differences in training procedures or model capacities.

What would settle it

Re-evaluate the single-model baselines using the exact same training protocol, data augmentation, and cross-validation splits as the proposed CAST model to check if the 3.3% gap remains.

Figures

Figures reproduced from arXiv: 2605.08663 by Fakhri Karray, Hamdi Altaheri, Md. Milon Islam, Md Rabiul Islam, Md Rezwanul Haque, Md. Shakhoyat Rahman Shujon, Sheikh Md. Galib Mahim.

Figure 1
Figure 1. Figure 1: Overall architecture of the proposed CAST architecture. Three-receiver RTMs are processed through per-stream CASA modules [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The CASA module. Per-antenna features are globally [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Most-confused pair: 67 N→56 M (7 errors). (a) RTM of a misclassified sample (true label: 67 N). (b) CVD of the same sample. (c) RTM of a correctly classified 56 M sample. (d) CVD of the same. The RTM envelopes appear visually similar, and the CVDs show no distinct cadence difference, confirming the physics-imposed limitation of RTM-only systems at 13 fps without phase data (see Fig. S2 for all confused pai… view at source ↗
read the original abstract

We propose CAST, a dual-stream architecture that utilizes channel-aware spatial transfer learning for isolated sign language recognition addressing the challenges of magnitude-only 60~GHz radar Range-Time Maps (RTM). The proposed framework combines three physics-aware architectures with pretrained vision backbones, which operate under radar-only constraints across clinical and alphabetical gestures. First, an explicit decibel-to-linear inversion is combined with a windowed fast Fourier transform that extracts Cadence Velocity Diagrams (CVD) while avoiding the harmonic artifacts that arise from the spectral analysis of log-compressed signals. Second, a cross-antenna spatial attention module applies attention to raw antenna channels before the convolution, preserving inter-receiver amplitude covariance. Third, an asymmetric cross-attention mechanism fuses representations from parallel ConvNeXt-Tiny (CVD) and EfficientNetV2-S (RTM) backbones. Extensive experiments reveal that the architecture achieves a Top-1 accuracy of 80.5% under 5-fold cross-validation, establishing a 3.3% improvement over the best single-model baseline (77.2%). The findings suggest that physics-aware signal representations form a promising direction for radar-only sign language recognition under constrained sensor modalities. The source code is available at: https://github.com/Shakhoyat/CAST-at-SignEval2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes CAST, a dual-stream architecture for isolated sign language recognition from 60 GHz radar Range-Time Maps (RTM). It combines a dB-to-linear inversion with windowed FFT to extract Cadence Velocity Diagrams (CVD) avoiding harmonic artifacts, a cross-antenna spatial attention module applied before convolution, and asymmetric cross-attention fusion between a ConvNeXt-Tiny stream on CVD and an EfficientNetV2-S stream on RTM. Under 5-fold cross-validation the model reports 80.5% Top-1 accuracy, a 3.3% gain over the best single-model baseline (77.2%). Source code is released.

Significance. If the reported gain can be isolated to the physics-aware inversion, cross-antenna attention, and asymmetric fusion rather than unmatched training protocols or increased model capacity, the work would demonstrate a useful direction for radar-only sign language recognition. The explicit incorporation of radar signal properties into pretrained vision backbones and the public code release are strengths that would support follow-on research in constrained sensor modalities.

major comments (2)
  1. [Abstract and Results] Abstract and experimental results: the central claim of a 3.3% Top-1 accuracy improvement (80.5% vs. 77.2%) under 5-fold CV is presented without any dataset size, class count, statistical tests, error bars, or ablation tables. This prevents verification that the delta arises from the three proposed modules rather than baseline differences.
  2. [Experimental Setup] Experimental setup: no section confirms that the single-model baselines (ConvNeXt-Tiny and EfficientNetV2-S) received identical optimizer, learning-rate schedule, batch size, data augmentation, epoch count, or initialization as the full dual-stream CAST model. The dual-stream design also increases total capacity, so the reported gain cannot yet be attributed specifically to the dB-to-linear inversion, cross-antenna attention, or asymmetric fusion.
minor comments (3)
  1. The manuscript should explicitly state the dataset (number of samples, number of classes, train/test split details) and the exact gesture vocabulary used for the clinical and alphabetical signs.
  2. [Experiments] Add a component-wise ablation table (removing inversion, removing cross-antenna attention, removing asymmetric fusion) to quantify each module's contribution.
  3. [Methods] Clarify the precise implementation of the windowed FFT (window type, length, overlap) and the channel dimensions of the cross-antenna attention module.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and rigor of our experimental claims. We address each major comment point by point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and experimental results: the central claim of a 3.3% Top-1 accuracy improvement (80.5% vs. 77.2%) under 5-fold CV is presented without any dataset size, class count, statistical tests, error bars, or ablation tables. This prevents verification that the delta arises from the three proposed modules rather than baseline differences.

    Authors: We agree that the abstract and results presentation would benefit from these details to allow readers to assess the source of the reported improvement. In the revised manuscript, we will expand the abstract to include the dataset size and class count from the SignEval2026 benchmark. We will also add error bars to the reported accuracies, include statistical significance tests (e.g., paired t-tests or McNemar's test across the 5 folds), and provide ablation tables that systematically isolate the contributions of the dB-to-linear inversion, cross-antenna spatial attention, and asymmetric cross-attention fusion. These additions will help verify that the 3.3% gain is attributable to the proposed physics-aware components. revision: yes

  2. Referee: [Experimental Setup] Experimental setup: no section confirms that the single-model baselines (ConvNeXt-Tiny and EfficientNetV2-S) received identical optimizer, learning-rate schedule, batch size, data augmentation, epoch count, or initialization as the full dual-stream CAST model. The dual-stream design also increases total capacity, so the reported gain cannot yet be attributed specifically to the dB-to-linear inversion, cross-antenna attention, or asymmetric fusion.

    Authors: We acknowledge that the manuscript does not explicitly confirm identical training protocols for the baselines in a dedicated section. In the revision, we will add a table or subsection detailing all hyperparameters (optimizer, learning-rate schedule, batch size, data augmentation, epochs, and initialization) and state that they are shared across the single-stream baselines and the full CAST model. To address the capacity concern, we will include an additional ablation comparing the full dual-stream CAST against a dual-stream variant that uses simple feature concatenation (without the proposed attention mechanisms) while keeping total capacity matched. This will help isolate the specific contributions of the physics-aware inversion and attention modules. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical CV accuracy is measured on held-out folds, independent of architecture definitions

full rationale

The paper reports a measured Top-1 accuracy of 80.5% under 5-fold cross-validation on radar sign-language data, compared against single-model baselines using the same pretrained backbones. This is a direct empirical result on external held-out folds, not an equation or parameter that reduces to its own inputs by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the architecture description or results. The three physics-aware modules (dB-to-linear + windowed FFT, cross-antenna attention, asymmetric fusion) are design choices whose performance impact is tested via comparison, not presupposed. Per the hard rules, an empirical result on CV folds with no reduction to fitted parameters or self-citation chains receives score 0. Concerns about unmatched training procedures or capacity are correctness risks, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard transfer-learning assumptions and signal-processing conventions without introducing new free parameters, axioms, or invented entities beyond the described network modules.

axioms (1)
  • domain assumption Pretrained vision backbones transfer effectively to radar-derived pseudo-images
    Invoked by the choice of ConvNeXt-Tiny and EfficientNetV2-S backbones operating on RTM and CVD inputs

pith-pipeline@v0.9.0 · 5574 in / 1158 out tokens · 47015 ms · 2026-05-12T01:30:30.905263+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    Multisource approaches to Italian sign language (LIS) recognition: In- sights from the MultiMedaLIS dataset

    Gaia Caligiore, Raffaele Mineo, Concetto Spampinato, Egidio Ragonese, Simone Palazzo, and Sabina Fontana. Multisource approaches to Italian sign language (LIS) recognition: In- sights from the MultiMedaLIS dataset. InProceedings of the Tenth Italian Conference on Computational Linguistics (CLiC- it 2024), pages 132–140, Pisa, Italy, 2024. CEUR Workshop Pr...

  2. [2]

    CrossViT: Cross-attention multi-scale vision transformer for image clas- sification

    Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. CrossViT: Cross-attention multi-scale vision transformer for image clas- sification. InProceedings of the IEEE/CVF International 8 Conference on Computer Vision (ICCV 2021), pages 357–

  3. [3]

    Dietterich

    Thomas G. Dietterich. Approximate statistical tests for com- paring supervised classification learning algorithms.Neural Computation, 10(7):1895–1923, 1998. 6

  4. [4]

    SignEval 2026 challenges results

    Ahmed Abul Hasanaath, Raffaele Mineo, Hamzah Luqman, Sarah Alyami, Maad Alowaifeer, Amelia Sorrenti, Gaia Cali- giore, Sabina Fontana, Egidio Ragonese, Giovanni Bellitto, Federica Proietto Salanitri, Concetto Spampinato, Motaz Al- farraj, Mufti Mahmud, Simone Palazzo, and Nour Imane Zeghib. SignEval 2026 challenges results. InProceedings of the IEEE/CVF C...

  5. [5]

    Squeeze-and-excitation networks

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7132–7141. IEEE/CVF, 2018. 4, 7

  6. [6]

    Milon Islam and Md

    Md. Milon Islam and Md. Rezwanul Haque. FusionEnsemble- Net: An attention-based ensemble of spatiotemporal networks for multimodal sign language recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2025), MSLR Workshop, pages 4983–

  7. [7]

    IEEE/CVF, 2025. 2, 8

  8. [8]

    Averaging weights leads to wider optima and better generalization

    Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. InProceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI 2018), pages 1–12. AUAI Press, 2018. 5

  9. [9]

    C. Jin, X. Meng, X. Li, J. Wang, M. Pan, et al. Rodar: Robust gesture recognition based on mmWave radar under human activity interference.IEEE Transactions on Mobile Computing, 23(12):11735–11749, 2024. 2

  10. [10]

    Multimodal Italian sign language recog- nition with radar-video late fusion

    Roman Juranek et al. Multimodal Italian sign language recog- nition with radar-video late fusion. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2025), MSLR Workshop, pages 5079–

  11. [11]

    IEEE/CVF, 2025. 1, 2

  12. [12]

    Human activity classification based on micro-Doppler signatures using a support vector machine.IEEE Transactions on Geoscience and Remote Sensing, 47(5):1328–1337, 2009

    Youngwook Kim and Hao Ling. Human activity classification based on micro-Doppler signatures using a support vector machine.IEEE Transactions on Geoscience and Remote Sensing, 47(5):1328–1337, 2009. 2, 7

  13. [13]

    Emre Karagozler, Patrick Amihood, Carsten Schwesig, Erik Olson, Hakim Raja, and Ivan Poupyrev

    Jaime Lien, Nicholas Gillian, M. Emre Karagozler, Patrick Amihood, Carsten Schwesig, Erik Olson, Hakim Raja, and Ivan Poupyrev. Soli: Ubiquitous gesture sensing with mil- limeter wave radar. InACM SIGGRAPH 2016 Papers, pages 1–19. ACM, 2016. 2

  14. [14]

    A ConvNet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11976–11986. IEEE/CVF, 2022. 4

  15. [15]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), pages 1–18, 2019. 5

  16. [16]

    Sign language recognition for patient-doctor communication: A multimedia/multimodal dataset

    Raffaele Mineo, Gaia Caligiore, Concetto Spampinato, Sabina Fontana, Simone Palazzo, and Egidio Ragonese. Sign language recognition for patient-doctor communication: A multimedia/multimodal dataset. InProceedings of the IEEE 8th Forum on Research and Technologies for Society and Industry Innovation (RTSI), pages 202–207. IEEE, 2024. 1, 6

  17. [17]

    Text-aligned radar-based sign language recognition for healthcare communication

    Raffaele Mineo, Amelia Sorrenti, Gaia Caligiore, Feder- ica Proietto Salanitri, Giovanni Bellitto, Senya Polikovsky, Sabina Fontana, Egidio Ragonese, Concetto Spampinato, and Simone Palazzo. Text-aligned radar-based sign language recognition for healthcare communication. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (...

  18. [18]

    Radar-based imaging for sign language recognition in medical communication

    Raffaele Mineo, Amelia Sorrenti, Gaia Caligiore, Feder- ica Proietto Salanitri, Giovanni Bellitto, Senya Polikovsky, Sabina Fontana, Egidio Ragonese, Concetto Spampinato, and Simone Palazzo. Radar-based imaging for sign language recognition in medical communication. InProceedings of the 28th International Conference on Medical Image Computing and Computer...

  19. [19]

    A benchmark for radar-based Italian sign lan- guage recognition using frequency-domain range-time maps

    Raffaele Mineo, Amelia Sorrenti, Gaia Caligiore, Feder- ica Proietto Salanitri, Giovanni Bellitto, Senya Polikovsky, Sabina Fontana, Egidio Ragonese, Concetto Spampinato, and Simone Palazzo. A benchmark for radar-based Italian sign lan- guage recognition using frequency-domain range-time maps. InProceedings of the IEEE/CVF Conference on Computer Vision an...

  20. [20]

    Inference for the gen- eralization error.Machine Learning, 52(3):239–281, 2003

    Claude Nadeau and Yoshua Bengio. Inference for the gen- eralization error.Machine Learning, 52(3):239–281, 2003. 6

  21. [21]

    Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D

    Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V . Le. SpecAugment: A simple data augmentation method for automatic speech recognition. InProceedings of Interspeech 2019, pages 2613–

  22. [22]

    Modality-specific benchmarks and radar range-doppler envelope classification for multimodal isolated sign language recognition

    Dmitriy Sazonov, Kamrul Islam, Evie Malaia, and Sevgi Gur- buz. Modality-specific benchmarks and radar range-doppler envelope classification for multimodal isolated sign language recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 5046–5053, 2025. 2, 8

  23. [23]

    Rethinking the inception archi- tecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception archi- tecture for computer vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826. IEEE/CVF, 2016. 5

  24. [24]

    Mingxing Tan and Quoc V . Le. EfficientNetV2: Smaller models and faster training. InProceedings of the 38th In- ternational Conference on Machine Learning (ICML), pages 10096–10106. PMLR, 2021. 4

  25. [25]

    Dynamic gesture recognition based on fmcw millimeter wave radar: Review of methodologies and results.Sensors, 23:7478, 2023

    Gaopeng Tang, Tongning Wu, and Congsheng Li. Dynamic gesture recognition based on fmcw millimeter wave radar: Review of methodologies and results.Sensors, 23:7478, 2023. 2

  26. [26]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. InAdvances in Neu- ral Information Processing Systems (NeurIPS), pages 1–10. Curran Associates, 2017. 5 9

  27. [27]

    A novel detection and recognition method for con- tinuous hand gesture using FMCW radar.IEEE Access, 8: 167264–167275, 2020

    Yong Wang, Aifeng Ren, Mu Zhou, Wei Wang, and Xiaodong Yang. A novel detection and recognition method for con- tinuous hand gesture using FMCW radar.IEEE Access, 8: 167264–167275, 2020. 2

  28. [28]

    PyTorch Image Models

    Ross Wightman. PyTorch Image Models. https : //github.com/huggingface/pytorch- image- models, 2019. 6

  29. [29]

    CBAM: Convolutional block attention module

    Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV 2018), pages 3–19. Springer, 2018. 4

  30. [30]

    CutMix: Regu- larization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. CutMix: Regu- larization strategy to train strong classifiers with localizable features. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV 2019), pages 6023–6032. IEEE/CVF, 2019. 5

  31. [31]

    mixup: Beyond Empirical Risk Minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv:1710.09412, 2017. 5 10 Supplementary Material CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition This document provides additional quantitative and qualitative analysis supplementing the main...