State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition
Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3
The pith
Phonological factorization in state space models scales skeleton-based sign recognition to 5,565 signs at 72 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The vocabulary scaling bottleneck in sign language recognition is fundamentally a representation learning problem that is solved by compositional inductive biases mirroring the phonological structure of signs. By using anatomically-grounded graph attention on skeleton data to factor sequences into orthogonal subspaces corresponding to phonological parameters and then performing prototypical classification, state space models achieve 72.1 percent accuracy on 5,565 signs, with dramatic improvements in the few-shot regime and zero-shot transfer to new datasets.
What carries the argument
PHONSSM, which applies anatomically-grounded graph attention to skeleton sequences, explicitly factors the representations into orthogonal subspaces for phonological parameters, and performs prototypical classification within a state space model.
If this is right
- Large-vocabulary sign recognition becomes practical using only low-cost skeleton input instead of video.
- Few-shot performance improves substantially because new signs can be composed from existing phonological subspaces.
- Zero-shot transfer across datasets becomes possible by relying on shared phonological structure rather than dataset-specific visual patterns.
- The same compositional bias should reduce the need for massive labeled video data in other structured sequence recognition tasks.
Where Pith is reading between the lines
- The result suggests that explicit factorization of compositional rules may help other domains where flat representations fail to scale, such as gesture or action recognition.
- Sufficiency of skeleton data implies that fine-grained appearance details are secondary to motion structure for distinguishing signs.
- Extending the orthogonal subspace approach to continuous signing or to other sign languages could test whether the phonological decomposition generalizes beyond isolated ASL signs.
Load-bearing premise
The phonological parameters of signs can be captured as orthogonal subspaces in the model's representation space through graph attention on skeleton joints.
What would settle it
A state space model without the phonological factorization or orthogonal subspace constraint achieves comparable accuracy on the 5,565-sign dataset, or the learned subspaces fail to correspond to handshape, location, movement, and orientation when examined.
Figures
read the original abstract
Sign language recognition suffers from catastrophic scaling failure: models achieving high accuracy on small vocabularies collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages-systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across the vocabulary. We introduce PHONSSM, enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization into orthogonal subspaces, and prototypical classification enabling few-shot transfer. Using skeleton data alone on the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% on WLASL2000 (+18.4pp over skeleton SOTA), surpassing most RGB methods without video input. Gains are most dramatic in the few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that sign language recognition exhibits catastrophic scaling failure due to atomic visual pattern treatment in existing models, and introduces PHONSSM to enforce phonological compositionality via anatomically-grounded graph attention on skeleton data, explicit orthogonal subspace factorization, and prototypical classification. Using skeleton input alone on a newly assembled 5,565-sign ASL dataset, PHONSSM achieves 72.1% accuracy on WLASL2000 (+18.4pp over prior skeleton SOTA), surpasses most RGB methods, and shows strong few-shot (+225% relative) and zero-shot transfer to ASL Citizen, attributing success to compositional inductive biases mirroring linguistic structure.
Significance. If the results hold after addressing potential confounds, this would be a meaningful contribution to sign language recognition by demonstrating that skeleton-based models can scale to large vocabularies and compete with video methods through linguistic phonological structure. The assembly of the largest ASL dataset to date is a clear strength and community resource. The reported few-shot and zero-shot gains, if reproducible, would support practical utility in low-data regimes and highlight the value of compositional biases over flat representations.
major comments (1)
- [Abstract] Abstract: The central claim attributes the +18.4pp gain on WLASL2000 and scaling success to the phonological mechanisms (anatomically-grounded graph attention, orthogonal subspace factorization, prototypical classification). However, no ablation is described that compares PHONSSM to a non-compositional baseline trained on the identical 5,565-sign dataset; prior skeleton SOTA used smaller subsets, so the performance delta may be driven by data volume rather than the inductive biases.
minor comments (3)
- The abstract provides no details on training procedure, exact baseline implementations and hyperparameters, error bars, statistical significance, or data splits, preventing verification of the numerical claims.
- The title emphasizes state space models, but the abstract does not clarify how SSM components integrate with the graph attention and subspace factorization; explicit description of this architecture would aid clarity.
- Additional information on the 5,565-sign dataset construction, including sources, collection/annotation methodology, and overlap with WLASL splits, would strengthen reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify the attribution of our results. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim attributes the +18.4pp gain on WLASL2000 and scaling success to the phonological mechanisms (anatomically-grounded graph attention, orthogonal subspace factorization, prototypical classification). However, no ablation is described that compares PHONSSM to a non-compositional baseline trained on the identical 5,565-sign dataset; prior skeleton SOTA used smaller subsets, so the performance delta may be driven by data volume rather than the inductive biases.
Authors: We agree that the abstract does not present a direct ablation of PHONSSM against a non-compositional baseline trained on the full 5,565-sign dataset, and that this leaves open the possibility that data scale contributes to the observed gains relative to prior skeleton methods. Our reported comparisons follow the standard WLASL2000 protocol against the strongest published skeleton baselines, which were trained on smaller subsets; the new dataset is a core contribution that enables vocabulary-scale evaluation. To isolate the role of the phonological inductive biases, we will add an ablation in the revised manuscript that trains a plain state-space-model baseline (without anatomically-grounded graph attention, orthogonal subspace factorization, or prototypical classification) on the identical 5,565-sign training set and reports its accuracy on WLASL2000. This will allow readers to quantify how much of the +18.4 pp improvement is attributable to the compositional mechanisms versus data volume alone. revision: yes
Circularity Check
No circularity in derivation chain; claims are empirical
full rationale
The paper introduces PHONSSM as an architecture with anatomically-grounded graph attention, subspace factorization, and prototypical classification, then reports empirical accuracies (72.1% on WLASL2000) and relative gains on a newly assembled 5,565-sign dataset. No equations, first-principles derivations, or predictions are presented that reduce the reported metrics to quantities defined by the model itself or to fitted parameters. The phonological inductive biases are architectural choices validated through standard train/test comparisons rather than self-referential loops. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The derivation chain is therefore self-contained empirical evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sign language signs are composed from a small set of reusable phonological parameters (handshape, location, movement, orientation) that can be modeled as orthogonal subspaces.
invented entities (1)
-
PHONSSM
no independent evidence
Reference graph
Works this paper leans on
-
[1]
ForD c ≥4, the global minimumL ortho = 0is achievable
-
[2]
(1) By Lemma B.6, each term cos2(c(i),c (j))≥0
The gradient with respect toc (k) is: ∇c(k) Lortho = X j̸=k 2 cos(c(k),c (j)) ∥c(k)∥2∥c(j)∥ c(j) −cos(c (k),c (j)) ∥c(j)∥ ∥c(k)∥ c(k) (6) Proof. (1) By Lemma B.6, each term cos2(c(i),c (j))≥0 . The sum equals zero iff every term equals zero, i.e., iff all pairs are orthogonal. (2) In RDc with Dc ≥4 , we can always find four mutually orthogonal vectors (e....
work page 2026
-
[3]
Centered by subtracting wrist position
-
[4]
Normalized to unit scale based on palm size
-
[5]
Resampled to 30 frames using linear interpolation
-
[6]
Augmented during training with random temporal shifts (±3 frames) and scale jitter (±10%) E ADDITIONALRESULTS E.1 PER-CLASSANALYSIS Table 6: Performance breakdown by phonological characteristics on WLASL100. ∆: improvement over Bi-LSTM. Category # Signs Bi-LSTM PHONSSM∆ One-handed signs 62 71.289.4+18.2 Two-handed signs 38 68.986.8+17.9 Static (no movemen...
work page 2026
-
[7]
Extract the time-averaged component embedding ¯c(i) ∈R 32
-
[8]
Train a linear classifier (logistic regression) to predict each phonological category
-
[9]
Each row shows one PDM branch; each column shows one phonological prediction task
Report accuracy on held-out samples (5-fold cross-validation) G.3 FULLRESULTS Table 11: Complete linear probe results. Each row shows one PDM branch; each column shows one phonological prediction task. Diagonal entries (bold) indicate intended correspondences. Prediction Target PDM Branch Handshape Location Movement Orientation Handshape78.431.2 24.8 38.1...
-
[10]
Specialization: Each branch achieves highest accuracy on its intended category (diagonal), confirming semantic correspondence
-
[11]
Above-chance cross-prediction: Off-diagonal entries exceed chance, indicating some phonological information leaks across branches. This is expected since components are correlated in natural signs (e.g., certain handshapes occur more often at certain locations)
-
[12]
Factorization benefit: The gap between diagonal and off-diagonal (e.g., 78.4 vs 31.2 for handshape) demonstrates effective factorization
-
[13]
Factorization-accuracy trade-off: The “Full embedding” row shows slightly higher accu- racy than individual branches (e.g., 81.2 vs 78.4 for handshape), indicating∼3pp is sacrificed for factorization. We experimented with relaxing λortho from 0.1 to 0.05: component probe accuracy improved by ∼2pp but sign-level accuracy dropped by 1.5pp due to increased r...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.