pith. sign in

arxiv: 2606.19398 · v1 · pith:QSSPSIEGnew · submitted 2026-06-17 · 💻 cs.SD · eess.AS· eess.SP

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

Pith reviewed 2026-06-26 19:44 UTC · model grok-4.3

classification 💻 cs.SD eess.ASeess.SP
keywords self-supervised speech learningsoft clusteringGaussian mixture modelJEPASUPERB benchmarkword error rateacoustic ambiguitycontinuous training
0
0 comments X

The pith

S-JEPA trains speech encoders by matching soft GMM posteriors instead of hard cluster labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces S-JEPA to train self-supervised speech encoders by predicting soft probabilities from a Gaussian mixture model at masked positions rather than discrete cluster IDs. This change allows training to run as one continuous process in two phases without stopping to re-cluster the full dataset or using a separate teacher model. An adaptive choice of which layer to cluster on comes from a label-free signal. If the approach holds, speech models could handle uncertainty at sound boundaries more faithfully while using fewer parameters on standard benchmarks.

Core claim

S-JEPA is a JEPA-style encoder-predictor pair trained to match the soft posteriors of a Gaussian Mixture Model at masked positions via KL divergence. Training runs continuously in two phases: a fixed GMM over MFCC features, then an online GMM over encoder features, with the input layer selected adaptively from a label-free signal. This removes both the offline re-cluster step and the hand-tuned choice of which transformer layer to cluster on. Under the SUPERB protocol, S-JEPA achieves the lowest WER among evaluated SSL methods below 90M parameters and matches HuBERT-Base on emotion recognition at roughly half its parameter count.

What carries the argument

Soft posterior matching to GMM anchors via KL divergence in a two-phase continuous training process with adaptive layer selection.

If this is right

  • Lowest WER among evaluated SSL methods below 90M parameters on SUPERB.
  • Matches HuBERT-Base on emotion recognition at roughly half the parameter count.
  • Predictor per-frame entropy on held-out speech shows a bimodal distribution with frames near perfect two-cluster tie.
  • Enables single continuous optimization trajectory without offline re-clustering or teacher distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could apply to other sequence domains where category boundaries carry real uncertainty, such as audio events or prosody labeling.
  • Removing periodic re-clustering steps would lower the total compute needed to reach a given performance level in large-scale pretraining runs.
  • Explicit modeling of per-frame ambiguity might improve calibration on downstream tasks that involve noisy or accented speech.

Load-bearing premise

The two-phase GMM with adaptive layer selection produces stable and representative soft posteriors that better capture acoustic ambiguity than hard clustering across diverse speech data.

What would settle it

An ablation that applies the identical two-phase continuous schedule and adaptive layer selection but replaces soft posteriors with hard cluster assignments, then measures whether WER and emotion recognition scores stay competitive.

Figures

Figures reproduced from arXiv: 2606.19398 by Aaron Elkins, Adrian Kieback, Aman Chadha, Georgios Ioannides, Judah Goldfeder, Linsey Pang, Ravid Shwartz-Ziv, Yann LeCun.

Figure 1
Figure 1. Figure 1: S-JEPA: a JEPA-style encoder–predictor pair matches soft GMM posteriors at masked positions, with the GMM fit in Phase 1 (frozen, over MFCC features) and updated online in Phase 2 (over EMA encoder features). The single training signal is KL divergence at masked positions; the predictor, cluster head, and GMM are discarded after pre-training. learned without us having to claim it explains why the method is… view at source ↗
Figure 2
Figure 2. Figure 2: Per-layer effective rank across both phases of training. The horizontal axis shows training step [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: S-JEPA advances the Pareto frontier on SUPERB below 90M parameters. Parameter count vs. performance on (a) ASR (WER on LibriSpeech test-clean, ↓) and (b) emotion recognition (accuracy, ↑). Dashed line: Pareto frontier across all evaluated SSL methods. Shaded region: S-JEPA’s improvement over the prior best in the 51.8M–90M range. S-JEPA reaches 12.10% WER (vs. 13.02% for DeCoAR 2.0 at 89.8M) and 64.83% ER … view at source ↗
Figure 4
Figure 4. Figure 4: The soft-target objective produces structured boundary uncertainty, not diffuse fuzziness. a On a single utterance, rank-1 posteriors sit near 1.0 within phonetically stable regions and drop to the 0.3–0.45 range near word boundaries with rank-2 rising to compete. b The same pattern at the population level: a bimodal entropy distribution with a confident regime below 0.3 bits and a clear secondary mode at … view at source ↗
read the original abstract

Self-supervised speech encoders are predominantly trained by predicting discrete hard cluster IDs at masked positions, a recipe that collapses acoustic ambiguity at category boundaries and requires interrupting training to re-cluster the entire corpus between iterations. We introduce S-JEPA, a JEPA-style encoder-predictor pair trained to match the soft posteriors of a Gaussian Mixture Model at masked positions via KL divergence. Training runs as one continuous optimization trajectory in two phases: a fixed GMM over MFCC features, then an online GMM over encoder features, with the input layer selected adaptively from a label-free signal, removing both the offline re-cluster step and the hand-tuned choice of which transformer layer to cluster on. Under the SUPERB protocol, S-JEPA achieves the lowest WER among evaluated SSL methods below 90M parameters and matches HuBERT-Base on emotion recognition at roughly half its parameter count, establishing a new Pareto frontier without offline re-clustering or teacher distillation. An analysis of the predictor's per-frame entropy on held-out speech reveals a bimodal distribution with a substantial minority of frames near the entropy of a perfect two-cluster tie, providing direct empirical evidence that the soft-target objective preserves the acoustic ambiguity that hard targets would collapse. Code is available at https://github.com/gioannides/s-jepa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces S-JEPA, a JEPA-style self-supervised speech encoder trained to predict soft GMM posteriors at masked positions via KL divergence rather than hard cluster IDs. Training proceeds in one continuous trajectory using a fixed MFCC GMM followed by an online GMM on encoder features, with adaptive label-free layer selection; this removes offline re-clustering and hand-tuned layer choice. Under SUPERB, the model reports the lowest WER among SSL methods below 90M parameters and matches HuBERT-Base on emotion recognition at roughly half the parameter count, supported by an entropy analysis showing preserved acoustic ambiguity.

Significance. If the central claims hold, the work offers a simpler, more efficient SSL recipe for speech that preserves boundary ambiguity without re-clustering or distillation, potentially improving the Pareto frontier for models under 90M parameters. The public code release is a clear strength for reproducibility.

major comments (3)
  1. [Method (two-phase GMM and online phase)] Method section on two-phase GMM: the headline SUPERB claims (lowest WER below 90M params, matching HuBERT-Base on emotion) rest on the online GMM phase producing stable, non-collapsed soft posteriors that differ from hard assignments; no convergence diagnostics, initialization sensitivity, or data-order ablation is reported to substantiate this.
  2. [Method (adaptive layer selection)] Adaptive layer selection paragraph: the label-free signal used to choose the input layer for the GMM is described at high level only; without evidence that the selected posteriors remain non-degenerate across acoustic conditions, the KL objective risks reducing to standard hard clustering and the claimed removal of hand-tuning yields no advantage.
  3. [Experiments (SUPERB evaluation)] Results section (SUPERB tables): the Pareto-frontier claim requires variance estimates or statistical tests across runs; single-point WER and emotion scores without these cannot reliably establish superiority over baselines under 90M parameters.
minor comments (2)
  1. [Abstract and entropy analysis] Abstract and §4: the entropy analysis is presented as direct evidence of ambiguity preservation, but the precise definition of the 'perfect two-cluster tie' baseline entropy should be stated explicitly for reproducibility.
  2. [Method equations] Notation: the distinction between the fixed MFCC GMM and the online encoder GMM should be denoted with distinct symbols to avoid reader confusion in the equations.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: [Method (two-phase GMM and online phase)] Method section on two-phase GMM: the headline SUPERB claims (lowest WER below 90M params, matching HuBERT-Base on emotion) rest on the online GMM phase producing stable, non-collapsed soft posteriors that differ from hard assignments; no convergence diagnostics, initialization sensitivity, or data-order ablation is reported to substantiate this.

    Authors: We agree that convergence diagnostics would strengthen the claims regarding the online GMM. In the revised manuscript we will add plots of GMM log-likelihood and posterior entropy over training steps to demonstrate stability. Initialization sensitivity and data-order ablation were not performed owing to compute limits; we will note this limitation explicitly. revision: partial

  2. Referee: [Method (adaptive layer selection)] Adaptive layer selection paragraph: the label-free signal used to choose the input layer for the GMM is described at high level only; without evidence that the selected posteriors remain non-degenerate across acoustic conditions, the KL objective risks reducing to standard hard clustering and the claimed removal of hand-tuning yields no advantage.

    Authors: We will expand the description of the label-free selection criterion with additional implementation details. We will also add an appendix analysis of posterior entropy on held-out data from varied acoustic conditions to confirm the selected posteriors remain non-degenerate. revision: yes

  3. Referee: [Experiments (SUPERB evaluation)] Results section (SUPERB tables): the Pareto-frontier claim requires variance estimates or statistical tests across runs; single-point WER and emotion scores without these cannot reliably establish superiority over baselines under 90M parameters.

    Authors: Single-run reporting follows the established practice in the SSL speech literature due to training cost. We will revise the text to acknowledge this limitation and refrain from claiming statistical superiority. revision: partial

standing simulated objections not resolved
  • Variance estimates or statistical tests across multiple independent runs, which would require new full-scale training experiments not performed in the original work.

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on external SUPERB benchmarks

full rationale

The paper introduces S-JEPA as a JEPA-style model trained via KL divergence to soft GMM posteriors in two phases (fixed MFCC then online encoder features) with adaptive layer selection. Its headline results (lowest WER below 90M params, matching HuBERT-Base on emotion at half size) are measured directly against the independent SUPERB protocol using standard downstream metrics. No equations, fitted parameters, or self-citations are shown to reduce the reported performance numbers to quantities defined solely by the method's own inputs or prior author work. The per-frame entropy analysis on held-out data is presented as separate empirical evidence. This is the common case of an empirical method whose central claims remain falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Relies on standard mathematical properties of KL divergence and GMM modeling plus domain assumptions of SSL for speech; introduces no new postulated entities.

free parameters (1)
  • GMM component count
    Number of mixture components in both fixed and online GMMs is a modeling choice that affects the soft posterior granularity.
axioms (2)
  • standard math KL divergence is a suitable objective for matching soft cluster distributions
    Invoked to train the predictor to match GMM posteriors at masked positions.
  • domain assumption Online GMM updates on encoder features remain stable without periodic resets
    Central to the continuous training claim versus hard clustering methods.

pith-pipeline@v0.9.1-grok · 5792 in / 1401 out tokens · 32770 ms · 2026-06-26T19:44:37.057509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 6 linked inside Pith

  1. [1]

    Self-supervised learn- ing from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learn- ing from images with a joint-embedding predictive architecture. InCVPR, 2023

  2. [2]

    vq-wav2vec: Self-supervised learning of discrete speech representations

    Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. InICLR, 2020

  3. [3]

    wav2vec 2.0: A framework for self-supervised learning of speech representations

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mo- hamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. InNeurIPS, 2020

  4. [4]

    data2vec: A gen- eral framework for self-supervised learning in speech, vision and language

    Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A gen- eral framework for self-supervised learning in speech, vision and language. InICML, 2022

  5. [5]

    Lejepa: Prov- able and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

    Randall Balestriero and Yann LeCun. Lejepa: Prov- able and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

  6. [6]

    Springer, 2006

    Christopher M Bishop.Pattern Recognition and Ma- chine Learning. Springer, 2006

  7. [7]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021

  8. [8]

    Distilhubert: Speech representation learning by layer- wise distillation of hidden-unit bert

    Heng-Jui Chang, Shu-wen Yang, and Hung-yi Lee. Distilhubert: Speech representation learning by layer- wise distillation of hidden-unit bert. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7087–

  9. [9]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. InIEEE JSTSP, 2022

  10. [10]

    Self-supervised learning with random-projection quantizer for speech recognition

    Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. Self-supervised learning with random-projection quantizer for speech recognition. InICML, 2022

  11. [11]

    An unsupervised autoregressive model for speech representation learning

    Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. An unsupervised autoregressive model for speech representation learning. InINTERSPEECH, 2019

  12. [12]

    Vector- quantized autoregressive predictive coding

    Yu-An Chung, Hao Tang, and James Glass. Vector- quantized autoregressive predictive coding. InIN- TERSPEECH, 2020

  13. [13]

    w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

    Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. InASRU, 2021

  14. [14]

    Maximum likelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society: Series B, 1977

    Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society: Series B, 1977

  15. [15]

    A-jepa: Joint-embedding predictive architecture can listen

    Zhengcong Fei, Mingyuan Fan, and Junshi Huang. A-jepa: Joint-embedding predictive architecture can listen. 2024. URL https://arxiv.org/abs/2311. 15830. 8

  16. [16]

    Rankme: Assessing the down- stream performance of pretrained self-supervised representations by their rank

    Quentin Garrido, Randall Balestriero, Laurent Naj- man, and Yann Lecun. Rankme: Assessing the down- stream performance of pretrained self-supervised representations by their rank. InInternational con- ference on machine learning, pages 10929–10974. PMLR, 2023

  17. [17]

    Ai must embrace specialization via superhuman adaptable intelligence.arXiv preprint arXiv:2602.23643, 2026

    Judah Goldfeder, Philippe Wyder, Yann LeCun, and Ravid Shwartz Ziv. Ai must embrace specialization via superhuman adaptable intelligence.arXiv preprint arXiv:2602.23643, 2026

  18. [18]

    Bootstrap your own latent: A new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020

  19. [19]

    Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  20. [20]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. InIEEE/ACM TASLP, 2021

  21. [21]

    Soft clustering anchors for self-supervised speech rep- resentation learning in joint embedding prediction architectures.arXiv preprint arXiv:2602.09040, 2026

    Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, and Ravid Shwartz-Ziv. Soft clustering anchors for self-supervised speech rep- resentation learning in joint embedding prediction architectures.arXiv preprint arXiv:2602.09040, 2026

  22. [22]

    Libri-light: A benchmark for asr with limited or no supervision

    Jacob Kahn, Morgane Rivière, Weiyi Zheng, Eugene Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Lber, Steffen Schneider, Em- manuel Dupoux, and Gabriel Synnaeve. Libri-light: A benchmark for asr with limited or no supervision. InICASSP, 2020

  23. [23]

    Gra- nary: Speech recognition and translation dataset in 25 european languages, 2025

    Nithin Rao Koluguri, Monica Sekoyan, George Ze- lenfroynd, Sasha Meister, Shuoyang Ding, Sofia Ko- standian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, and Boris Ginsburg. Gra- nary: Speech recognition and translation dataset in 25 european languages, 2025. URL https://arxiv. org/abs/2505.13404

  24. [24]

    Cengage Learning, 7th edition, 2014

    Peter Ladefoged and Keith Johnson.A Course in Pho- netics. Cengage Learning, 7th edition, 2014

  25. [25]

    A path towards autonomous machine intelligence.OpenReview, 2022

    Yann LeCun. A path towards autonomous machine intelligence.OpenReview, 2022

  26. [26]

    Decoar 2.0: Deep contextualized acoustic representations with vector quantization.arXiv preprint arXiv:2012.06659, 2020

    Shaoshi Ling and Yuzong Liu. Decoar 2.0: Deep contextualized acoustic representations with vector quantization.arXiv preprint arXiv:2012.06659, 2020

  27. [27]

    Liu, Yu-An Chung, and James Glass

    Alexander H. Liu, Yu-An Chung, and James Glass. Non-autoregressive predictive coding for learning speech representations from local dependencies. 2020. URLhttps://arxiv.org/abs/2011.00406

  28. [28]

    Liu, Shang-Wen Yang, Po-Han Chi, Po-Chun Hsu, and Hung-yi Lee

    Andy T. Liu, Shang-Wen Yang, Po-Han Chi, Po-Chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speech representation learning with deep bidirec- tional transformer encoders. InICASSP, 2020

  29. [29]

    Liu et al

    Andy T. Liu et al. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM TASLP, 2021

  30. [30]

    Decoupled weight decay regularization.ICLR, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR, 2019

  31. [31]

    Some methods for classification and analysis of multivariate observations

    James MacQueen. Some methods for classification and analysis of multivariate observations. InProceed- ings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967

  32. [32]

    Librispeech: An asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. InICASSP, 2015

  33. [33]

    Learning problem-agnostic speech representations from multi- ple self-supervised tasks

    Santiago Pascual, Mirco Ravanelli, Joan Serra, An- tonio Bonafonte, and Yoshua Bengio. Learning problem-agnostic speech representations from multi- ple self-supervised tasks. InINTERSPEECH, 2019

  34. [34]

    Robust speech recognition via large-scale weak supervision,

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision,

  35. [35]

    URLhttps://arxiv.org/abs/2212.04356

  36. [36]

    wav2vec: Unsupervised pre- training for speech recognition

    Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre- training for speech recognition. InInterspeech, 2019

  37. [37]

    Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

  38. [38]

    Stevens.Acoustic Phonetics

    Kenneth N. Stevens.Acoustic Phonetics. MIT Press, 1998

  39. [39]

    The information bottleneck method.arXiv preprint physics/0004057, 2000

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

  40. [40]

    Representation learning with contrastive predictive coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. InarXiv preprint arXiv:1807.03748, 2018

  41. [41]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 9

  42. [42]

    Jeffrey S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37– 57, 1985

  43. [43]

    Lin, Andy T

    Shu wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang- Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung yi Lee. Superb: Speech processing uni- versal performance bench...

  44. [44]

    Dropping the visible-position term (S=M ) is consistent with the masked-only configuration used by data2vec [4], wav2vec 2.0 [3], and HuBERT [20]

    Dropping the visible-position loss.Visible-position cluster predictions are an easier prediction problem (the encoder sees the input directly there) and may supply gradients that are largely redundant with what the masked- position loss already provides. Dropping the visible-position term (S=M ) is consistent with the masked-only configuration used by dat...

  45. [45]

    Disabling augmentation.We set pnoise =p mix = 0, so the encoder sees the same clean waveform that the EMA encoder feeds to the GMM. This is in tension with WavLM’s [9] finding that denoising augmentation is helpful, and likely reflects an interaction specific to our setup: if the GMM target is computed from clean audio while the encoder sees augmented aud...

  46. [46]

    Sample mix lengthL mix as a random fraction of the primary length

  47. [47]

    Sample start positionst 1,t 2 in the primary and secondary signals

  48. [48]

    Extract regionsr 1 ←x 1[t1 :t 1 +L],r 2 ←x 2[t2 :t 2 +L]and compute energiesE 1,E 2

  49. [49]

    Sample energy ratioρ∼Uniform over a fixed dB range

  50. [50]

    J SUPERB Evaluation Protocol We follow the standard SUPERB [42] protocol: the pre-trained encoder is frozen and small task-specific heads are trained on top

    Computeβand mixx 1[t1 :t 1 +L]←r 1 +β·r 2. J SUPERB Evaluation Protocol We follow the standard SUPERB [42] protocol: the pre-trained encoder is frozen and small task-specific heads are trained on top. Across tasks, the encoder receives raw waveform input and the task-specific head consumes its frame-level outputs (or a learned weighted combination across ...