State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition

Austin Jin; Bryan Cheng; Jasper Zhang

arxiv: 2604.08761 · v1 · submitted 2026-04-09 · 💻 cs.CV

State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition

Bryan Cheng , Austin Jin , Jasper Zhang This is my paper

Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords sign language recognitionstate space modelsphonological compositionalityskeleton dataASLvocabulary scalinggraph attentionprototypical classification

0 comments

The pith

Phonological factorization in state space models scales skeleton-based sign recognition to 5,565 signs at 72 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sign language recognition collapses on large vocabularies because models treat each sign as an indivisible visual pattern instead of reusable combinations of handshape, location, movement, and orientation. The paper introduces PHONSSM, a state space model that enforces phonological decomposition by applying anatomically grounded graph attention to skeleton sequences, factoring the resulting representations into orthogonal subspaces, and using prototypical classification. On the largest ASL dataset assembled so far, this approach reaches 72.1 percent top-1 accuracy on WLASL2000 using only skeleton data, an 18-point gain over prior skeleton methods and better than most RGB baselines. Gains are largest in the few-shot setting and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The central argument is that the scaling failure is a representation problem solved by building linguistic compositionality directly into the architecture.

Core claim

The vocabulary scaling bottleneck in sign language recognition is fundamentally a representation learning problem that is solved by compositional inductive biases mirroring the phonological structure of signs. By using anatomically-grounded graph attention on skeleton data to factor sequences into orthogonal subspaces corresponding to phonological parameters and then performing prototypical classification, state space models achieve 72.1 percent accuracy on 5,565 signs, with dramatic improvements in the few-shot regime and zero-shot transfer to new datasets.

What carries the argument

PHONSSM, which applies anatomically-grounded graph attention to skeleton sequences, explicitly factors the representations into orthogonal subspaces for phonological parameters, and performs prototypical classification within a state space model.

If this is right

Large-vocabulary sign recognition becomes practical using only low-cost skeleton input instead of video.
Few-shot performance improves substantially because new signs can be composed from existing phonological subspaces.
Zero-shot transfer across datasets becomes possible by relying on shared phonological structure rather than dataset-specific visual patterns.
The same compositional bias should reduce the need for massive labeled video data in other structured sequence recognition tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests that explicit factorization of compositional rules may help other domains where flat representations fail to scale, such as gesture or action recognition.
Sufficiency of skeleton data implies that fine-grained appearance details are secondary to motion structure for distinguishing signs.
Extending the orthogonal subspace approach to continuous signing or to other sign languages could test whether the phonological decomposition generalizes beyond isolated ASL signs.

Load-bearing premise

The phonological parameters of signs can be captured as orthogonal subspaces in the model's representation space through graph attention on skeleton joints.

What would settle it

A state space model without the phonological factorization or orthogonal subspace constraint achieves comparable accuracy on the 5,565-sign dataset, or the learned subspaces fail to correspond to handshape, location, movement, and orientation when examined.

Figures

Figures reproduced from arXiv: 2604.08761 by Austin Jin, Bryan Cheng, Jasper Zhang.

**Figure 1.** Figure 1: PhonSSM architecture. Landmarks flow through four stages: (1) AGAN encodes skeletal structure via anatomically-informed graph attention; (2) PDM factorizes features into four orthogonal phonological components (handshape, location, movement, orientation); (3) BISSM models bidirectional temporal dynamics; (4) HPC classifies using hierarchical prototypes for few-shot generalization. Input is T × N × 3 where … view at source ↗

**Figure 2.** Figure 2: Main results. (a,b,d) WLASL evaluation using pose+hands input (75 landmarks). (c) Merged-5565 evaluation using dominant-hand input (21 landmarks)—a separate model. Specifically: (a) WLASL-2000: 72.1% vs baselines. (b) Vocabulary scaling (separate models per split). (c) Few-shot accuracy by training samples; gains largest for rare signs. (d) Ablation: PDM removal causes largest drop (−11.9pp) [PITH_FULL_I… view at source ↗

**Figure 3.** Figure 3: Component factorization. Cosine similarity matrix. Phonological decomposition enables few-shot learning (Table 2): for signs with 1–5 samples, 13.27% vs 4.08% (+225%). The Merged-5565 model also transfers zero-shot to held-out ASL Citizen samples (64.1% on overlapping vocabulary), compared to the RGB-based baseline of 63.2% reported in Desai et al. (2023) which requires full supervision. 4.5 ANALYSIS Com… view at source ↗

**Figure 4.** Figure 4: Phonological analysis. (a) Minimal pair error rates by component type. (b) Per-component accuracy across datasets. (c) Error rate by training frequency—gains are largest for rare signs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Sign language recognition suffers from catastrophic scaling failure: models achieving high accuracy on small vocabularies collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages-systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across the vocabulary. We introduce PHONSSM, enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization into orthogonal subspaces, and prototypical classification enabling few-shot transfer. Using skeleton data alone on the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% on WLASL2000 (+18.4pp over skeleton SOTA), surpassing most RGB methods without video input. Gains are most dramatic in the few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PHONSSM gets strong scaling numbers on skeleton-based sign recognition by factoring phonological subspaces, but the headline gains may trace more to the new large dataset than to the inductive biases.

read the letter

The paper introduces PHONSSM, a state space model that applies anatomically grounded graph attention to skeleton sequences, factors the representations into orthogonal subspaces for handshape, location, movement, and orientation, and uses prototypical classification for recognition. On a newly assembled 5,565-sign ASL dataset it reaches 72.1% on WLASL2000, an 18-point lift over prior skeleton methods and competitive with many RGB approaches, with especially large relative gains in few-shot and zero-shot settings on ASL Citizen.

Referee Report

1 major / 3 minor

Summary. The paper claims that sign language recognition exhibits catastrophic scaling failure due to atomic visual pattern treatment in existing models, and introduces PHONSSM to enforce phonological compositionality via anatomically-grounded graph attention on skeleton data, explicit orthogonal subspace factorization, and prototypical classification. Using skeleton input alone on a newly assembled 5,565-sign ASL dataset, PHONSSM achieves 72.1% accuracy on WLASL2000 (+18.4pp over prior skeleton SOTA), surpasses most RGB methods, and shows strong few-shot (+225% relative) and zero-shot transfer to ASL Citizen, attributing success to compositional inductive biases mirroring linguistic structure.

Significance. If the results hold after addressing potential confounds, this would be a meaningful contribution to sign language recognition by demonstrating that skeleton-based models can scale to large vocabularies and compete with video methods through linguistic phonological structure. The assembly of the largest ASL dataset to date is a clear strength and community resource. The reported few-shot and zero-shot gains, if reproducible, would support practical utility in low-data regimes and highlight the value of compositional biases over flat representations.

major comments (1)

[Abstract] Abstract: The central claim attributes the +18.4pp gain on WLASL2000 and scaling success to the phonological mechanisms (anatomically-grounded graph attention, orthogonal subspace factorization, prototypical classification). However, no ablation is described that compares PHONSSM to a non-compositional baseline trained on the identical 5,565-sign dataset; prior skeleton SOTA used smaller subsets, so the performance delta may be driven by data volume rather than the inductive biases.

minor comments (3)

The abstract provides no details on training procedure, exact baseline implementations and hyperparameters, error bars, statistical significance, or data splits, preventing verification of the numerical claims.
The title emphasizes state space models, but the abstract does not clarify how SSM components integrate with the graph attention and subspace factorization; explicit description of this architecture would aid clarity.
Additional information on the 5,565-sign dataset construction, including sources, collection/annotation methodology, and overlap with WLASL splits, would strengthen reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the attribution of our results. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim attributes the +18.4pp gain on WLASL2000 and scaling success to the phonological mechanisms (anatomically-grounded graph attention, orthogonal subspace factorization, prototypical classification). However, no ablation is described that compares PHONSSM to a non-compositional baseline trained on the identical 5,565-sign dataset; prior skeleton SOTA used smaller subsets, so the performance delta may be driven by data volume rather than the inductive biases.

Authors: We agree that the abstract does not present a direct ablation of PHONSSM against a non-compositional baseline trained on the full 5,565-sign dataset, and that this leaves open the possibility that data scale contributes to the observed gains relative to prior skeleton methods. Our reported comparisons follow the standard WLASL2000 protocol against the strongest published skeleton baselines, which were trained on smaller subsets; the new dataset is a core contribution that enables vocabulary-scale evaluation. To isolate the role of the phonological inductive biases, we will add an ablation in the revised manuscript that trains a plain state-space-model baseline (without anatomically-grounded graph attention, orthogonal subspace factorization, or prototypical classification) on the identical 5,565-sign training set and reports its accuracy on WLASL2000. This will allow readers to quantify how much of the +18.4 pp improvement is attributable to the compositional mechanisms versus data volume alone. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims are empirical

full rationale

The paper introduces PHONSSM as an architecture with anatomically-grounded graph attention, subspace factorization, and prototypical classification, then reports empirical accuracies (72.1% on WLASL2000) and relative gains on a newly assembled 5,565-sign dataset. No equations, first-principles derivations, or predictions are presented that reduce the reported metrics to quantities defined by the model itself or to fitted parameters. The phonological inductive biases are architectural choices validated through standard train/test comparisons rather than self-referential loops. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The derivation chain is therefore self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that sign language phonology provides reusable discrete parameters that can be isolated via graph attention and orthogonal factorization; no free parameters or new physical entities are introduced beyond the model itself.

axioms (1)

domain assumption Sign language signs are composed from a small set of reusable phonological parameters (handshape, location, movement, orientation) that can be modeled as orthogonal subspaces.
Invoked to justify the explicit factorization and graph attention design.

invented entities (1)

PHONSSM no independent evidence
purpose: State space model architecture that enforces phonological decomposition for sign language recognition.
New model name and design introduced to address scaling failure.

pith-pipeline@v0.9.0 · 5471 in / 1450 out tokens · 49168 ms · 2026-05-10T17:08:54.162896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

ForD c ≥4, the global minimumL ortho = 0is achievable

work page
[2]

(1) By Lemma B.6, each term cos2(c(i),c (j))≥0

The gradient with respect toc (k) is: ∇c(k) Lortho = X j̸=k 2 cos(c(k),c (j)) ∥c(k)∥2∥c(j)∥ c(j) −cos(c (k),c (j)) ∥c(j)∥ ∥c(k)∥ c(k) (6) Proof. (1) By Lemma B.6, each term cos2(c(i),c (j))≥0 . The sum equals zero iff every term equals zero, i.e., iff all pairs are orthogonal. (2) In RDc with Dc ≥4 , we can always find four mutually orthogonal vectors (e....

work page 2026
[3]

Centered by subtracting wrist position

work page
[4]

Normalized to unit scale based on palm size

work page
[5]

Resampled to 30 frames using linear interpolation

work page
[6]

mother”/“father

Augmented during training with random temporal shifts (±3 frames) and scale jitter (±10%) E ADDITIONALRESULTS E.1 PER-CLASSANALYSIS Table 6: Performance breakdown by phonological characteristics on WLASL100. ∆: improvement over Bi-LSTM. Category # Signs Bi-LSTM PHONSSM∆ One-handed signs 62 71.289.4+18.2 Two-handed signs 38 68.986.8+17.9 Static (no movemen...

work page 2026
[7]

Extract the time-averaged component embedding ¯c(i) ∈R 32

work page
[8]

Train a linear classifier (logistic regression) to predict each phonological category

work page
[9]

Each row shows one PDM branch; each column shows one phonological prediction task

Report accuracy on held-out samples (5-fold cross-validation) G.3 FULLRESULTS Table 11: Complete linear probe results. Each row shows one PDM branch; each column shows one phonological prediction task. Diagonal entries (bold) indicate intended correspondences. Prediction Target PDM Branch Handshape Location Movement Orientation Handshape78.431.2 24.8 38.1...

work page
[10]

Specialization: Each branch achieves highest accuracy on its intended category (diagonal), confirming semantic correspondence

work page
[11]

This is expected since components are correlated in natural signs (e.g., certain handshapes occur more often at certain locations)

Above-chance cross-prediction: Off-diagonal entries exceed chance, indicating some phonological information leaks across branches. This is expected since components are correlated in natural signs (e.g., certain handshapes occur more often at certain locations)

work page
[12]

Factorization benefit: The gap between diagonal and off-diagonal (e.g., 78.4 vs 31.2 for handshape) demonstrates effective factorization

work page
[13]

Full embedding

Factorization-accuracy trade-off: The “Full embedding” row shows slightly higher accu- racy than individual branches (e.g., 81.2 vs 78.4 for handshape), indicating∼3pp is sacrificed for factorization. We experimented with relaxing λortho from 0.1 to 0.05: component probe accuracy improved by ∼2pp but sign-level accuracy dropped by 1.5pp due to increased r...

work page 2026

[1] [1]

ForD c ≥4, the global minimumL ortho = 0is achievable

work page

[2] [2]

(1) By Lemma B.6, each term cos2(c(i),c (j))≥0

The gradient with respect toc (k) is: ∇c(k) Lortho = X j̸=k 2 cos(c(k),c (j)) ∥c(k)∥2∥c(j)∥ c(j) −cos(c (k),c (j)) ∥c(j)∥ ∥c(k)∥ c(k) (6) Proof. (1) By Lemma B.6, each term cos2(c(i),c (j))≥0 . The sum equals zero iff every term equals zero, i.e., iff all pairs are orthogonal. (2) In RDc with Dc ≥4 , we can always find four mutually orthogonal vectors (e....

work page 2026

[3] [3]

Centered by subtracting wrist position

work page

[4] [4]

Normalized to unit scale based on palm size

work page

[5] [5]

Resampled to 30 frames using linear interpolation

work page

[6] [6]

mother”/“father

Augmented during training with random temporal shifts (±3 frames) and scale jitter (±10%) E ADDITIONALRESULTS E.1 PER-CLASSANALYSIS Table 6: Performance breakdown by phonological characteristics on WLASL100. ∆: improvement over Bi-LSTM. Category # Signs Bi-LSTM PHONSSM∆ One-handed signs 62 71.289.4+18.2 Two-handed signs 38 68.986.8+17.9 Static (no movemen...

work page 2026

[7] [7]

Extract the time-averaged component embedding ¯c(i) ∈R 32

work page

[8] [8]

Train a linear classifier (logistic regression) to predict each phonological category

work page

[9] [9]

Each row shows one PDM branch; each column shows one phonological prediction task

Report accuracy on held-out samples (5-fold cross-validation) G.3 FULLRESULTS Table 11: Complete linear probe results. Each row shows one PDM branch; each column shows one phonological prediction task. Diagonal entries (bold) indicate intended correspondences. Prediction Target PDM Branch Handshape Location Movement Orientation Handshape78.431.2 24.8 38.1...

work page

[10] [10]

Specialization: Each branch achieves highest accuracy on its intended category (diagonal), confirming semantic correspondence

work page

[11] [11]

This is expected since components are correlated in natural signs (e.g., certain handshapes occur more often at certain locations)

Above-chance cross-prediction: Off-diagonal entries exceed chance, indicating some phonological information leaks across branches. This is expected since components are correlated in natural signs (e.g., certain handshapes occur more often at certain locations)

work page

[12] [12]

Factorization benefit: The gap between diagonal and off-diagonal (e.g., 78.4 vs 31.2 for handshape) demonstrates effective factorization

work page

[13] [13]

Full embedding

Factorization-accuracy trade-off: The “Full embedding” row shows slightly higher accu- racy than individual branches (e.g., 81.2 vs 78.4 for handshape), indicating∼3pp is sacrificed for factorization. We experimented with relaxing λortho from 0.1 to 0.05: component probe accuracy improved by ∼2pp but sign-level accuracy dropped by 1.5pp due to increased r...

work page 2026