arxiv: 2602.18899 · v3 · submitted 2026-02-21 · 📡 eess.AS · cs.CL· cs.LG· cs.SD

Recognition: 3 theorem links

· Lean Theorem

[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

Kwanghee Choi , Eunjung Yeo , Cheol Jun Cho , David Harwath , David R. Mortensen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:07 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LGcs.SD

keywords self-supervised speech modelsphonological featuresvector arithmeticphonetic representationscompositional vectorsspeech embeddingsmultilingual analysis

0 comments

The pith

Self-supervised speech models encode speech using phonologically interpretable compositional vectors that support arithmetic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-supervised speech models contain linear directions in their representation space that align with phonological features. The difference between sounds like [d] and [t] produces a vector for voicing that can be added to [p] to create [b] or scaled to vary the feature strength continuously. This demonstrates that the models use structured, compositional representations for speech sounds rather than unstructured encodings. Readers should care because it provides a way to interpret and manipulate the internal knowledge of these widely used models in a linguistically meaningful manner. The study covers representations from models trained on 96 languages to support the generality of the finding.

Core claim

Self-supervised speech models discover linear directions in their representation space that correspond to phonological features, with the scale of these vectors correlating to the degree of acoustic realization, enabling demonstrations of phonological vector arithmetic such as deriving a voicing vector from [d] minus [t] and adding it to [p] to produce [b].

What carries the argument

Phonological vectors, which are linear directions in the model's representation space corresponding to specific phonological features and enabling arithmetic operations on them.

Load-bearing premise

The extracted linear directions isolate phonological features specifically rather than correlated acoustic or speaker properties, and the arithmetic generalizes beyond tested cases.

What would settle it

Observing that the voicing vector from [d] - [t] added to [p] does not result in acoustic properties closer to [b] or fails perceptual tests for voicing.

read the original abstract

Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic-arithmetic .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S3Ms show linear phonological directions supporting vector arithmetic across 96 languages, but acoustic confounds like VOT need explicit controls.

read the letter

The main point is that the authors identify linear directions in self-supervised speech model embeddings tied to phonological features such as voicing, with examples like subtracting [t] from [d] to get a vector that turns [p] into [b] when added, and scaling that vector produces a voicing continuum. They report this pattern holding across 96 languages with code and demos released for inspection. That broad empirical sweep and the concrete arithmetic demonstrations are the solid parts here, extending embedding arithmetic ideas to phonology in a systematic way without relying on heavy theory. The continuous scaling result adds a useful angle beyond binary switches. The soft spot is the lack of detail on ruling out acoustic correlates. The abstract links the vectors to acoustic realization but does not describe regressions against measures like voice onset time or formants, or any orthogonalization steps, so it remains possible the directions track surface acoustics rather than abstract phonological structure. If the full methods include those checks and statistical tests for robustness, the claim strengthens; otherwise the interpretation stays tentative. This work fits readers in speech model interpretability and multilingual phonetics who want to probe how representations organize phonetic content. It is worth a serious referee because the observation is new enough and the language coverage is substantial, even if the controls need tightening to pin down what the vectors actually isolate.

Referee Report

3 major / 2 minor

Summary. The paper claims that self-supervised speech models (S3Ms) encode phonological features as linear directions in their representation spaces across 96 languages. It reports that difference vectors (e.g., [d]−[t]) isolate features such as voicing, that adding these vectors to other phoneme embeddings produces the expected phonological outcomes (e.g., [p] + voicing vector yields [b]), and that the magnitude of these vectors scales continuously with the degree of acoustic realization of the corresponding feature.

Significance. If the central observations survive controls for acoustic confounds, the work would provide evidence that S3Ms discover compositional, phonologically interpretable structure rather than purely acoustic patterns. The release of code and interactive demos is a clear strength that supports reproducibility and further investigation.

major comments (3)

[§4.1] §4.1 (Vector extraction): Phonological directions are obtained directly from phoneme embedding differences without any regression against or orthogonalization to acoustic correlates (VOT, formant frequencies, spectral tilt). This omission is load-bearing for the claim that the directions are phonological rather than acoustic.
[§5.2] §5.2 and Figure 4 (Arithmetic demonstrations): The [p] + voicing = [b] and scaling results are shown on selected examples; no quantitative metrics (e.g., phoneme classification accuracy on held-out sets, cosine similarity distributions, or cross-language generalization statistics) are reported to establish that the arithmetic holds reliably beyond the illustrated cases.
[Table 3] Table 3 (Correlation results): Reported correlations between vector scale and acoustic realization lack null-model baselines (shuffled labels or acoustic-only embeddings) and statistical tests that would rule out spurious alignment driven by surface acoustics.

minor comments (2)

[Abstract] The abstract and §1 should explicitly state the layer(s) and pooling method used to obtain phoneme embeddings for each S3M.
[§2] Add a brief comparison to prior linear-probe and probing studies on S3Ms to situate the vector-arithmetic findings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. Where the concerns are valid, we have revised the manuscript to incorporate additional controls, quantitative metrics, and statistical tests, which we believe strengthen the evidence for phonological vector arithmetic in S3Ms.

read point-by-point responses

Referee: [§4.1] §4.1 (Vector extraction): Phonological directions are obtained directly from phoneme embedding differences without any regression against or orthogonalization to acoustic correlates (VOT, formant frequencies, spectral tilt). This omission is load-bearing for the claim that the directions are phonological rather than acoustic.

Authors: We agree that distinguishing phonological from acoustic encoding requires explicit controls. In the revised manuscript, we add an orthogonalization step in §4.1: we regress phoneme embeddings against acoustic features (VOT, formant frequencies, spectral tilt) extracted from the same utterances and recompute difference vectors on the residuals. We show that the resulting directions preserve the arithmetic properties (e.g., [p] + voicing vector ≈ [b]) across languages, supporting that the vectors capture phonological structure beyond surface acoustics. revision: yes
Referee: [§5.2] §5.2 and Figure 4 (Arithmetic demonstrations): The [p] + voicing = [b] and scaling results are shown on selected examples; no quantitative metrics (e.g., phoneme classification accuracy on held-out sets, cosine similarity distributions, or cross-language generalization statistics) are reported to establish that the arithmetic holds reliably beyond the illustrated cases.

Authors: We accept that selected examples alone are insufficient. The revised §5.2 now reports aggregate quantitative metrics across all 96 languages: mean cosine similarity between the arithmetic result and target phoneme embedding (with standard deviation), phoneme classification accuracy via linear probes on held-out phoneme pairs, and the fraction of languages where arithmetic succeeds above a cosine threshold of 0.7. These statistics confirm reliable generalization beyond the illustrated cases. revision: yes
Referee: [Table 3] Table 3 (Correlation results): Reported correlations between vector scale and acoustic realization lack null-model baselines (shuffled labels or acoustic-only embeddings) and statistical tests that would rule out spurious alignment driven by surface acoustics.

Authors: We agree that null models and significance testing are necessary. In the revised Table 3, we add two baselines: (1) correlations after shuffling acoustic realization labels across utterances, and (2) correlations using acoustic-only embeddings derived from MFCCs. We also include p-values from 10,000-iteration permutation tests for each reported correlation. These controls demonstrate that the observed alignments are not explained by surface acoustics alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical probing of fixed pre-trained representations

full rationale

The paper identifies linear directions via direct differences of phoneme embeddings extracted from fixed, pre-trained S3Ms (e.g., [d] minus [t] for voicing) and demonstrates arithmetic by applying those same vectors to other embeddings. These are observational measurements on an existing representation space with no fitted parameters, no self-referential definitions, and no load-bearing self-citations that reduce the central claims to tautologies. The reported correlations and arithmetic results follow from the model's fixed outputs rather than any derivation that re-uses the same quantities as both input and prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that phonological features manifest as approximately linear directions in the representation space of S3Ms; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Phonological features correspond to linear directions in S3M representation space
Invoked when identifying vectors via subtraction and addition operations.

pith-pipeline@v0.9.0 · 5492 in / 1205 out tokens · 19916 ms · 2026-05-15T21:07:03.133078+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first show that there exist linear directions within the model's representation space that correspond to phonological features
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the scale λ will control the acoustic characteristics associated with a phonological vector in a continuous manner

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.