pith. sign in

arxiv: 2604.08761 · v1 · submitted 2026-04-09 · 💻 cs.CV

State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition

Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords sign language recognitionstate space modelsphonological compositionalityskeleton dataASLvocabulary scalinggraph attentionprototypical classification
0
0 comments X

The pith

Phonological factorization in state space models scales skeleton-based sign recognition to 5,565 signs at 72 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sign language recognition collapses on large vocabularies because models treat each sign as an indivisible visual pattern instead of reusable combinations of handshape, location, movement, and orientation. The paper introduces PHONSSM, a state space model that enforces phonological decomposition by applying anatomically grounded graph attention to skeleton sequences, factoring the resulting representations into orthogonal subspaces, and using prototypical classification. On the largest ASL dataset assembled so far, this approach reaches 72.1 percent top-1 accuracy on WLASL2000 using only skeleton data, an 18-point gain over prior skeleton methods and better than most RGB baselines. Gains are largest in the few-shot setting and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The central argument is that the scaling failure is a representation problem solved by building linguistic compositionality directly into the architecture.

Core claim

The vocabulary scaling bottleneck in sign language recognition is fundamentally a representation learning problem that is solved by compositional inductive biases mirroring the phonological structure of signs. By using anatomically-grounded graph attention on skeleton data to factor sequences into orthogonal subspaces corresponding to phonological parameters and then performing prototypical classification, state space models achieve 72.1 percent accuracy on 5,565 signs, with dramatic improvements in the few-shot regime and zero-shot transfer to new datasets.

What carries the argument

PHONSSM, which applies anatomically-grounded graph attention to skeleton sequences, explicitly factors the representations into orthogonal subspaces for phonological parameters, and performs prototypical classification within a state space model.

If this is right

  • Large-vocabulary sign recognition becomes practical using only low-cost skeleton input instead of video.
  • Few-shot performance improves substantially because new signs can be composed from existing phonological subspaces.
  • Zero-shot transfer across datasets becomes possible by relying on shared phonological structure rather than dataset-specific visual patterns.
  • The same compositional bias should reduce the need for massive labeled video data in other structured sequence recognition tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests that explicit factorization of compositional rules may help other domains where flat representations fail to scale, such as gesture or action recognition.
  • Sufficiency of skeleton data implies that fine-grained appearance details are secondary to motion structure for distinguishing signs.
  • Extending the orthogonal subspace approach to continuous signing or to other sign languages could test whether the phonological decomposition generalizes beyond isolated ASL signs.

Load-bearing premise

The phonological parameters of signs can be captured as orthogonal subspaces in the model's representation space through graph attention on skeleton joints.

What would settle it

A state space model without the phonological factorization or orthogonal subspace constraint achieves comparable accuracy on the 5,565-sign dataset, or the learned subspaces fail to correspond to handshape, location, movement, and orientation when examined.

Figures

Figures reproduced from arXiv: 2604.08761 by Austin Jin, Bryan Cheng, Jasper Zhang.

Figure 1
Figure 1. Figure 1: PhonSSM architecture. Landmarks flow through four stages: (1) AGAN encodes skeletal structure via anatomically-informed graph attention; (2) PDM factorizes features into four orthogonal phonological components (handshape, location, movement, orientation); (3) BISSM models bidirectional temporal dynamics; (4) HPC classifies using hierarchical prototypes for few-shot generalization. Input is T × N × 3 where … view at source ↗
Figure 2
Figure 2. Figure 2: Main results. (a,b,d) WLASL evaluation using pose+hands input (75 landmarks). (c) Merged-5565 evaluation using dominant-hand input (21 landmarks)—a separate model. Specif￾ically: (a) WLASL-2000: 72.1% vs baselines. (b) Vocabulary scaling (separate models per split). (c) Few-shot accuracy by training samples; gains largest for rare signs. (d) Ablation: PDM removal causes largest drop (−11.9pp) [PITH_FULL_I… view at source ↗
Figure 3
Figure 3. Figure 3: Component factorization. Cosine similarity matrix. Phonological decomposition enables few-shot learning (Ta￾ble 2): for signs with 1–5 samples, 13.27% vs 4.08% (+225%). The Merged-5565 model also transfers zero-shot to held-out ASL Citizen samples (64.1% on overlapping vocabulary), compared to the RGB-based baseline of 63.2% reported in Desai et al. (2023) which requires full supervi￾sion. 4.5 ANALYSIS Com… view at source ↗
Figure 4
Figure 4. Figure 4: Phonological analysis. (a) Minimal pair error rates by component type. (b) Per-component accuracy across datasets. (c) Error rate by training frequency—gains are largest for rare signs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Sign language recognition suffers from catastrophic scaling failure: models achieving high accuracy on small vocabularies collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages-systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across the vocabulary. We introduce PHONSSM, enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization into orthogonal subspaces, and prototypical classification enabling few-shot transfer. Using skeleton data alone on the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% on WLASL2000 (+18.4pp over skeleton SOTA), surpassing most RGB methods without video input. Gains are most dramatic in the few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims that sign language recognition exhibits catastrophic scaling failure due to atomic visual pattern treatment in existing models, and introduces PHONSSM to enforce phonological compositionality via anatomically-grounded graph attention on skeleton data, explicit orthogonal subspace factorization, and prototypical classification. Using skeleton input alone on a newly assembled 5,565-sign ASL dataset, PHONSSM achieves 72.1% accuracy on WLASL2000 (+18.4pp over prior skeleton SOTA), surpasses most RGB methods, and shows strong few-shot (+225% relative) and zero-shot transfer to ASL Citizen, attributing success to compositional inductive biases mirroring linguistic structure.

Significance. If the results hold after addressing potential confounds, this would be a meaningful contribution to sign language recognition by demonstrating that skeleton-based models can scale to large vocabularies and compete with video methods through linguistic phonological structure. The assembly of the largest ASL dataset to date is a clear strength and community resource. The reported few-shot and zero-shot gains, if reproducible, would support practical utility in low-data regimes and highlight the value of compositional biases over flat representations.

major comments (1)
  1. [Abstract] Abstract: The central claim attributes the +18.4pp gain on WLASL2000 and scaling success to the phonological mechanisms (anatomically-grounded graph attention, orthogonal subspace factorization, prototypical classification). However, no ablation is described that compares PHONSSM to a non-compositional baseline trained on the identical 5,565-sign dataset; prior skeleton SOTA used smaller subsets, so the performance delta may be driven by data volume rather than the inductive biases.
minor comments (3)
  1. The abstract provides no details on training procedure, exact baseline implementations and hyperparameters, error bars, statistical significance, or data splits, preventing verification of the numerical claims.
  2. The title emphasizes state space models, but the abstract does not clarify how SSM components integrate with the graph attention and subspace factorization; explicit description of this architecture would aid clarity.
  3. Additional information on the 5,565-sign dataset construction, including sources, collection/annotation methodology, and overlap with WLASL splits, would strengthen reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the attribution of our results. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim attributes the +18.4pp gain on WLASL2000 and scaling success to the phonological mechanisms (anatomically-grounded graph attention, orthogonal subspace factorization, prototypical classification). However, no ablation is described that compares PHONSSM to a non-compositional baseline trained on the identical 5,565-sign dataset; prior skeleton SOTA used smaller subsets, so the performance delta may be driven by data volume rather than the inductive biases.

    Authors: We agree that the abstract does not present a direct ablation of PHONSSM against a non-compositional baseline trained on the full 5,565-sign dataset, and that this leaves open the possibility that data scale contributes to the observed gains relative to prior skeleton methods. Our reported comparisons follow the standard WLASL2000 protocol against the strongest published skeleton baselines, which were trained on smaller subsets; the new dataset is a core contribution that enables vocabulary-scale evaluation. To isolate the role of the phonological inductive biases, we will add an ablation in the revised manuscript that trains a plain state-space-model baseline (without anatomically-grounded graph attention, orthogonal subspace factorization, or prototypical classification) on the identical 5,565-sign training set and reports its accuracy on WLASL2000. This will allow readers to quantify how much of the +18.4 pp improvement is attributable to the compositional mechanisms versus data volume alone. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims are empirical

full rationale

The paper introduces PHONSSM as an architecture with anatomically-grounded graph attention, subspace factorization, and prototypical classification, then reports empirical accuracies (72.1% on WLASL2000) and relative gains on a newly assembled 5,565-sign dataset. No equations, first-principles derivations, or predictions are presented that reduce the reported metrics to quantities defined by the model itself or to fitted parameters. The phonological inductive biases are architectural choices validated through standard train/test comparisons rather than self-referential loops. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The derivation chain is therefore self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that sign language phonology provides reusable discrete parameters that can be isolated via graph attention and orthogonal factorization; no free parameters or new physical entities are introduced beyond the model itself.

axioms (1)
  • domain assumption Sign language signs are composed from a small set of reusable phonological parameters (handshape, location, movement, orientation) that can be modeled as orthogonal subspaces.
    Invoked to justify the explicit factorization and graph attention design.
invented entities (1)
  • PHONSSM no independent evidence
    purpose: State space model architecture that enforces phonological decomposition for sign language recognition.
    New model name and design introduced to address scaling failure.

pith-pipeline@v0.9.0 · 5471 in / 1450 out tokens · 49168 ms · 2026-05-10T17:08:54.162896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    ForD c ≥4, the global minimumL ortho = 0is achievable

  2. [2]

    (1) By Lemma B.6, each term cos2(c(i),c (j))≥0

    The gradient with respect toc (k) is: ∇c(k) Lortho = X j̸=k 2 cos(c(k),c (j)) ∥c(k)∥2∥c(j)∥ c(j) −cos(c (k),c (j)) ∥c(j)∥ ∥c(k)∥ c(k) (6) Proof. (1) By Lemma B.6, each term cos2(c(i),c (j))≥0 . The sum equals zero iff every term equals zero, i.e., iff all pairs are orthogonal. (2) In RDc with Dc ≥4 , we can always find four mutually orthogonal vectors (e....

  3. [3]

    Centered by subtracting wrist position

  4. [4]

    Normalized to unit scale based on palm size

  5. [5]

    Resampled to 30 frames using linear interpolation

  6. [6]

    mother”/“father

    Augmented during training with random temporal shifts (±3 frames) and scale jitter (±10%) E ADDITIONALRESULTS E.1 PER-CLASSANALYSIS Table 6: Performance breakdown by phonological characteristics on WLASL100. ∆: improvement over Bi-LSTM. Category # Signs Bi-LSTM PHONSSM∆ One-handed signs 62 71.289.4+18.2 Two-handed signs 38 68.986.8+17.9 Static (no movemen...

  7. [7]

    Extract the time-averaged component embedding ¯c(i) ∈R 32

  8. [8]

    Train a linear classifier (logistic regression) to predict each phonological category

  9. [9]

    Each row shows one PDM branch; each column shows one phonological prediction task

    Report accuracy on held-out samples (5-fold cross-validation) G.3 FULLRESULTS Table 11: Complete linear probe results. Each row shows one PDM branch; each column shows one phonological prediction task. Diagonal entries (bold) indicate intended correspondences. Prediction Target PDM Branch Handshape Location Movement Orientation Handshape78.431.2 24.8 38.1...

  10. [10]

    Specialization: Each branch achieves highest accuracy on its intended category (diagonal), confirming semantic correspondence

  11. [11]

    This is expected since components are correlated in natural signs (e.g., certain handshapes occur more often at certain locations)

    Above-chance cross-prediction: Off-diagonal entries exceed chance, indicating some phonological information leaks across branches. This is expected since components are correlated in natural signs (e.g., certain handshapes occur more often at certain locations)

  12. [12]

    Factorization benefit: The gap between diagonal and off-diagonal (e.g., 78.4 vs 31.2 for handshape) demonstrates effective factorization

  13. [13]

    Full embedding

    Factorization-accuracy trade-off: The “Full embedding” row shows slightly higher accu- racy than individual branches (e.g., 81.2 vs 78.4 for handshape), indicating∼3pp is sacrificed for factorization. We experimented with relaxing λortho from 0.1 to 0.05: component probe accuracy improved by ∼2pp but sign-level accuracy dropped by 1.5pp due to increased r...