Recognition: 3 theorem links
· Lean Theorem[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic
Pith reviewed 2026-05-15 21:07 UTC · model grok-4.3
The pith
Self-supervised speech models encode speech using phonologically interpretable compositional vectors that support arithmetic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Self-supervised speech models discover linear directions in their representation space that correspond to phonological features, with the scale of these vectors correlating to the degree of acoustic realization, enabling demonstrations of phonological vector arithmetic such as deriving a voicing vector from [d] minus [t] and adding it to [p] to produce [b].
What carries the argument
Phonological vectors, which are linear directions in the model's representation space corresponding to specific phonological features and enabling arithmetic operations on them.
Load-bearing premise
The extracted linear directions isolate phonological features specifically rather than correlated acoustic or speaker properties, and the arithmetic generalizes beyond tested cases.
What would settle it
Observing that the voicing vector from [d] - [t] added to [p] does not result in acoustic properties closer to [b] or fails perceptual tests for voicing.
read the original abstract
Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic-arithmetic .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that self-supervised speech models (S3Ms) encode phonological features as linear directions in their representation spaces across 96 languages. It reports that difference vectors (e.g., [d]−[t]) isolate features such as voicing, that adding these vectors to other phoneme embeddings produces the expected phonological outcomes (e.g., [p] + voicing vector yields [b]), and that the magnitude of these vectors scales continuously with the degree of acoustic realization of the corresponding feature.
Significance. If the central observations survive controls for acoustic confounds, the work would provide evidence that S3Ms discover compositional, phonologically interpretable structure rather than purely acoustic patterns. The release of code and interactive demos is a clear strength that supports reproducibility and further investigation.
major comments (3)
- [§4.1] §4.1 (Vector extraction): Phonological directions are obtained directly from phoneme embedding differences without any regression against or orthogonalization to acoustic correlates (VOT, formant frequencies, spectral tilt). This omission is load-bearing for the claim that the directions are phonological rather than acoustic.
- [§5.2] §5.2 and Figure 4 (Arithmetic demonstrations): The [p] + voicing = [b] and scaling results are shown on selected examples; no quantitative metrics (e.g., phoneme classification accuracy on held-out sets, cosine similarity distributions, or cross-language generalization statistics) are reported to establish that the arithmetic holds reliably beyond the illustrated cases.
- [Table 3] Table 3 (Correlation results): Reported correlations between vector scale and acoustic realization lack null-model baselines (shuffled labels or acoustic-only embeddings) and statistical tests that would rule out spurious alignment driven by surface acoustics.
minor comments (2)
- [Abstract] The abstract and §1 should explicitly state the layer(s) and pooling method used to obtain phoneme embeddings for each S3M.
- [§2] Add a brief comparison to prior linear-probe and probing studies on S3Ms to situate the vector-arithmetic findings.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. Where the concerns are valid, we have revised the manuscript to incorporate additional controls, quantitative metrics, and statistical tests, which we believe strengthen the evidence for phonological vector arithmetic in S3Ms.
read point-by-point responses
-
Referee: [§4.1] §4.1 (Vector extraction): Phonological directions are obtained directly from phoneme embedding differences without any regression against or orthogonalization to acoustic correlates (VOT, formant frequencies, spectral tilt). This omission is load-bearing for the claim that the directions are phonological rather than acoustic.
Authors: We agree that distinguishing phonological from acoustic encoding requires explicit controls. In the revised manuscript, we add an orthogonalization step in §4.1: we regress phoneme embeddings against acoustic features (VOT, formant frequencies, spectral tilt) extracted from the same utterances and recompute difference vectors on the residuals. We show that the resulting directions preserve the arithmetic properties (e.g., [p] + voicing vector ≈ [b]) across languages, supporting that the vectors capture phonological structure beyond surface acoustics. revision: yes
-
Referee: [§5.2] §5.2 and Figure 4 (Arithmetic demonstrations): The [p] + voicing = [b] and scaling results are shown on selected examples; no quantitative metrics (e.g., phoneme classification accuracy on held-out sets, cosine similarity distributions, or cross-language generalization statistics) are reported to establish that the arithmetic holds reliably beyond the illustrated cases.
Authors: We accept that selected examples alone are insufficient. The revised §5.2 now reports aggregate quantitative metrics across all 96 languages: mean cosine similarity between the arithmetic result and target phoneme embedding (with standard deviation), phoneme classification accuracy via linear probes on held-out phoneme pairs, and the fraction of languages where arithmetic succeeds above a cosine threshold of 0.7. These statistics confirm reliable generalization beyond the illustrated cases. revision: yes
-
Referee: [Table 3] Table 3 (Correlation results): Reported correlations between vector scale and acoustic realization lack null-model baselines (shuffled labels or acoustic-only embeddings) and statistical tests that would rule out spurious alignment driven by surface acoustics.
Authors: We agree that null models and significance testing are necessary. In the revised Table 3, we add two baselines: (1) correlations after shuffling acoustic realization labels across utterances, and (2) correlations using acoustic-only embeddings derived from MFCCs. We also include p-values from 10,000-iteration permutation tests for each reported correlation. These controls demonstrate that the observed alignments are not explained by surface acoustics alone. revision: yes
Circularity Check
No significant circularity; empirical probing of fixed pre-trained representations
full rationale
The paper identifies linear directions via direct differences of phoneme embeddings extracted from fixed, pre-trained S3Ms (e.g., [d] minus [t] for voicing) and demonstrates arithmetic by applying those same vectors to other embeddings. These are observational measurements on an existing representation space with no fitted parameters, no self-referential definitions, and no load-bearing self-citations that reduce the central claims to tautologies. The reported correlations and arithmetic results follow from the model's fixed outputs rather than any derivation that re-uses the same quantities as both input and prediction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Phonological features correspond to linear directions in S3M representation space
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first show that there exist linear directions within the model's representation space that correspond to phonological features
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the scale λ will control the acoustic characteristics associated with a phonological vector in a continuous manner
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.