pith. sign in

arxiv: 2606.30356 · v1 · pith:XS7TPDSEnew · submitted 2026-06-29 · 💻 cs.CL · cs.LG· cs.SD· eess.AS

OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL

Pith reviewed 2026-06-30 06:15 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SDeess.AS
keywords self-supervised learningspeech representation learningwaveform reconstructionmasked latent predictionview augmentationgeneration tasksspeaker tasksrecognition tasks
0
0 comments X

The pith

OLIVE jointly optimizes view-augmented masked latent prediction with waveform reconstruction to produce speech representations that support both generation and recognition tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OLIVE as a self-supervised speech representation learning method that combines view-augmented masked latent prediction with waveform reconstruction in one objective. Reconstruction is meant to keep early encoder layers focused on signal details while the prediction term pushes later layers toward invariant contextual features. The central goal is to obtain representations that handle a wide set of downstream tasks without the usual performance trade-offs between analysis and synthesis. A sympathetic reader would care because the approach suggests a single pretraining stage could yield models useful for generation, speaker work, recognition, and semantics.

Core claim

OLIVE combines view-augmented masked latent prediction with waveform reconstruction under a unified objective. Reconstruction constrains early encoder features to retain signal-level information, while masked latent prediction shapes later contextual representations toward invariance. This enables representations that support a broad range of tasks, improving results on generation and speaker tasks, maintaining competitive performance on recognition and semantic tasks, and improving waveform reconstruction.

What carries the argument

The unified objective of view-augmented masked latent prediction and waveform reconstruction, which assigns reconstruction the role of preserving signal details in early features and prediction the role of creating invariance in later features.

If this is right

  • Generation tasks show improved results
  • Speaker tasks show improved results
  • Recognition and semantic tasks remain competitive
  • Waveform reconstruction quality itself improves

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-objective pattern might allow a single pretrained model to serve both analysis and synthesis speech applications without separate fine-tuning paths.
  • Similar dual-objective designs could be tested in other audio domains such as music or environmental sound processing.
  • The separation of early and late layer roles might generalize to other self-supervised frameworks that currently optimize only one type of objective.

Load-bearing premise

The reconstruction objective will constrain early encoder features to retain signal-level information while the masked prediction objective shapes later contextual representations toward invariance, without the two objectives interfering or requiring extensive hyperparameter tuning to balance.

What would settle it

An ablation experiment in which removing the waveform reconstruction term either improves recognition performance or fails to improve generation and speaker performance would falsify the claim that the joint objective produces non-interfering complementary features.

Figures

Figures reproduced from arXiv: 2606.30356 by Karl El Hajal, Mathew Magimai.-Doss.

Figure 1
Figure 1. Figure 1: OLIVE pre-training framework. Two independently augmented waveform views are passed through a shared local feature extractor. The analysis branch performs view-augmented masked distillation: the student predicts contextual teacher targets from masked input features. The synthesis branch performs waveform reconstruction by conditioning a neural vocoder on the local student features. The final objective comb… view at source ↗
Figure 2
Figure 2. Figure 2: SUPERB radar comparisons. Left and center: normalized per-task profiles. Right: aggregate category-level comparison. Per-task metrics are normalized so that FBANK is 0 and the best value in Tables 1 and 2 is 1000; multi-metric tasks are averaged within task. 4.3 Waveform Reconstruction Evaluation Setup. We evaluate waveform reconstruction directly by training feature-conditioned HiFi-GAN V2 vocoders from s… view at source ↗
Figure 3
Figure 3. Figure 3: Representative SUPERB layer-combination weights. Automatic speech recognition (ASR) and slot filling (SF) emphasize later contextual layers, while automatic speaker verification (ASV) and speech enhancement (SE) place more weight on earlier layers. D ASR Fine-tuning Evaluation [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spectrograms for three reference utterances and reconstructions from the frozen feature [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spectrograms for one reference utterance and reconstructions from HiFi-GAN V2 vocoders [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

We propose Online Latent prediction with Invariant Views and rEconstruction (OLIVE), a self-supervised speech representation learning framework that jointly optimizes analysis and synthesis objectives. OLIVE combines view-augmented masked latent prediction with waveform reconstruction under a unified objective. Reconstruction constrains early encoder features to retain signal-level information, while masked latent prediction shapes later contextual representations toward invariance for robust downstream performance. We show that these objectives enable representations that support a broad range of tasks. In particular, OLIVE improves results on generation and speaker tasks, maintains competitive performance on recognition and semantic tasks, and improves waveform reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper proposes OLIVE, a self-supervised speech representation learning framework that jointly optimizes view-augmented masked latent prediction and waveform reconstruction under a unified objective. Reconstruction is intended to constrain early encoder features to retain signal-level information, while masked latent prediction shapes later contextual representations toward invariance. The authors claim that this enables representations supporting a broad range of tasks, specifically improving results on generation and speaker tasks, maintaining competitive performance on recognition and semantic tasks, and improving waveform reconstruction.

Significance. If the empirical claims hold, the framework provides a principled way to balance local signal fidelity with contextual invariance in speech SSL without requiring separate models or extensive balancing of objectives. This could yield more versatile representations applicable across generation, speaker, recognition, and semantic tasks.

minor comments (1)
  1. [Abstract] Abstract: the abstract asserts specific performance improvements and mechanistic separation of roles between the two objectives but supplies no quantitative results, baselines, ablation studies, error bars, or layer-wise diagnostics to support these claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, the assessment of its potential significance, and the recommendation for minor revision. We are glad that the unified objective and its intended effects on early and later layers are viewed as a principled contribution.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, parameter fits, self-citations, or derivation steps that reduce a claimed prediction or result to its own inputs by construction. The framework is described as jointly optimizing two objectives with intended separation of roles for early vs. later layers, but this is presented as a design choice without any mathematical reduction, uniqueness theorem, or fitted-input-as-prediction pattern. No load-bearing self-referential elements appear, so the derivation chain (such as it is) remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no mathematical structure, free parameters, or explicit axioms are stated in the provided text.

axioms (1)
  • domain assumption Joint optimization of reconstruction and masked prediction will produce representations that simultaneously retain signal detail and achieve invariance without destructive interference.
    Implicit in the abstract's mechanistic description of how the two objectives affect early versus later layers.

pith-pipeline@v0.9.1-grok · 5633 in / 1158 out tokens · 15775 ms · 2026-06-30T06:15:07.735009+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, et al

    Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, et al. Self-supervised speech representation learning: A review.IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, 2022

  2. [2]

    Schuller

    Shuo Liu, Adria Mallol-Ragolta, Emilia Parada-Cabaleiro, Kun Qian, Xin Jing, Alexander Kathan, Bin Hu, and Bjoern W. Schuller. Audio self-supervised learning: A survey.Patterns, 3(12), 2022

  3. [3]

    wav2vec 2.0: A framework for self-supervised learning of speech representations

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. InNeurIPS, 2020

  4. [4]

    HuBERT: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021

  5. [5]

    WavLM: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

  6. [6]

    data2vec: A general framework for self-supervised learning in speech, vision and language

    Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. InICML, 2022

  7. [7]

    Efficient self-supervised learning with contextualized target representations for vision, speech and language

    Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli. Efficient self-supervised learning with contextualized target representations for vision, speech and language. InICML, 2023

  8. [8]

    Lin, Andy T

    Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y . Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung-yi Lee. SUPERB: Speech processing universal performance benchma...

  9. [9]

    Specializing self-supervised speech representations for speaker segmentation

    Séverin Baroudi, Thomas Pellegrini, and Hervé Bredin. Specializing self-supervised speech representations for speaker segmentation. InInterspeech, 2024

  10. [10]

    A fine-tuned wav2vec 2.0/Hu- BERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,

    Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwahab Heba. A fine-tuned wav2vec 2.0/HuBERT benchmark for speech emotion recognition, speaker verification and spoken language understanding.arXiv preprint arXiv:2111.02735, 2021

  11. [11]

    Evaluating self-supervised speech representations for speech emotion recognition.IEEE Access, 10:124396–124407, 2022

    Bagus Tris Atmaja and Akira Sasou. Evaluating self-supervised speech representations for speech emotion recognition.IEEE Access, 10:124396–124407, 2022

  12. [12]

    Emotion information recovery potential of a wav2vec 2.0 network fine-tuned for speech recognition

    Tilak Purohit and Mathew Magimai-Doss. Emotion information recovery potential of a wav2vec 2.0 network fine-tuned for speech recognition. InICASSP, 2025

  13. [13]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  14. [14]

    wav2vec: Unsupervised pre- training for speech recognition

    Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre- training for speech recognition. InInterspeech, 2019

  15. [15]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNAACL, 2019

  16. [16]

    DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization.arXiv preprint arXiv:2012.06659, 2020

    Shaoshi Ling and Yuzong Liu. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization.arXiv preprint arXiv:2012.06659, 2020

  17. [17]

    Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee

    Andy T. Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. InICASSP, 2020

  18. [18]

    Generative pre-training for speech with autoregressive predictive coding

    Yu-An Chung and James Glass. Generative pre-training for speech with autoregressive predictive coding. InICASSP, 2020. 10

  19. [19]

    Liu, Yu-An Chung, and James Glass

    Alexander H. Liu, Yu-An Chung, and James Glass. Non-autoregressive predictive coding for learning speech representations from local dependencies. InInterspeech, 2021

  20. [20]

    Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020

  21. [21]

    BYOL for audio: Self-supervised learning for general-purpose audio representation

    Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. BYOL for audio: Self-supervised learning for general-purpose audio representation. InIJCNN, 2021

  22. [22]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021

  23. [23]

    Atal and Suzanne L

    Bishnu S. Atal and Suzanne L. Hanauer. Speech analysis and synthesis by linear prediction of the speech wave.Journal of the Acoustical Society of America, 50(2B):637–655, 1971

  24. [24]

    McAulay and Thomas F

    Robert J. McAulay and Thomas F. Quatieri. Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(4):744–754, 1986

  25. [25]

    Speech analysis/synthesis based on matching the synthesized and the original representations in the auditory nerve level

    Oded Ghitza. Speech analysis/synthesis based on matching the synthesized and the original representations in the auditory nerve level. InICASSP, 1986

  26. [26]

    Mouton, The Hague, 1960

    Gunnar Fant.Acoustic Theory of Speech Production. Mouton, The Hague, 1960

  27. [27]

    Titze.Principles of V oice Production

    Ingo R. Titze.Principles of V oice Production. Prentice Hall, Englewood Cliffs, NJ, 1994

  28. [28]

    Mechanics of human voice production and control.The Journal of the Acoustical Society of America, 140(4):2614–2635, 2016

    Zhaoyan Zhang. Mechanics of human voice production and control.The Journal of the Acoustical Society of America, 140(4):2614–2635, 2016

  29. [29]

    Neural analysis and synthesis: Reconstructing speech from self-supervised representations

    Hyeong-Seok Choi, Juheon Lee, Wansoo Kim, Jie Hwan Lee, Hoon Heo, and Kyogu Lee. Neural analysis and synthesis: Reconstructing speech from self-supervised representations. InNeurIPS, 2021

  30. [30]

    Speech resynthesis from discrete disentangled self-supervised representations

    Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrah- man Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. InInterspeech, 2021

  31. [31]

    WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

    Hubert Siuzdak, Piotr Dura, Pol van Rijn, and Nori Jacoby. WavThruVec: Latent speech representation as intermediate features for neural speech synthesis. InInterspeech, 2022

  32. [32]

    vec2wav 2.0: Advancing voice conversion via discrete token vocoders.arXiv preprint arXiv:2409.01995, 2024

    Yiwei Guo, Zhihan Li, Junjie Li, Chenpeng Du, Hankun Wang, Shuai Wang, Xie Chen, and Kai Yu. vec2wav 2.0: Advancing voice conversion via discrete token vocoders.arXiv preprint arXiv:2409.01995, 2024

  33. [33]

    V oice conversion with just nearest neighbors

    Matthew Baas, Benjamin van Niekerk, and Herman Kamper. V oice conversion with just nearest neighbors. InInterspeech, 2023

  34. [34]

    kNN retrieval for simple and effective zero-shot multi-speaker text-to-speech

    Karl El Hajal, Ajinkya Kulkarni, Enno Hermann, and Mathew Magimai-Doss. kNN retrieval for simple and effective zero-shot multi-speaker text-to-speech. InNAACL, 2025

  35. [35]

    Self-supervised learning for speech enhancement through synthesis

    Bryce Irvin, Marko Stamenovic, Mikolaj Kegler, and Li-Chia Yang. Self-supervised learning for speech enhancement through synthesis. InICASSP, 2023

  36. [36]

    data2vec-SG: Improving self-supervised learning representations for speech generation tasks

    Heming Wang, Yao Qian, Hemin Yang, Nauyuki Kanda, Peidong Wang, Takuya Yoshioka, Xiaofei Wang, Yiming Wang, Shujie Liu, Zhuo Chen, et al. data2vec-SG: Improving self-supervised learning representations for speech generation tasks. InICASSP, 2023

  37. [37]

    Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R

    Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, and Bryan Catanzaro. UniWav: Towards unified pre-training for speech representation learning and generation. InICLR, 2025

  38. [38]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NIPS, 2017

  39. [39]

    vq-wav2vec: Self-supervised learning of discrete speech representations

    Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. InICLR, 2020

  40. [40]

    An unsupervised autoregressive model for speech representation learning

    Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. An unsupervised autoregressive model for speech representation learning. InInterspeech, 2019. 11

  41. [41]

    Discovering predictable classifications.Neural Computation, 5 (4):625–635, 1993

    Jürgen Schmidhuber and Daniel Prelinger. Discovering predictable classifications.Neural Computation, 5 (4):625–635, 1993

  42. [42]

    BYOL-S: Learning self-supervised speech representations by bootstrapping

    Gasser Elbanna, Neil Scheidwasser-Clow, Mikolaj Kegler, Pierre Beckmann, Karl El Hajal, and Milos Cernak. BYOL-S: Learning self-supervised speech representations by bootstrapping. InHEAR: Holistic Evaluation of Audio Representations, volume 166 ofProceedings of Machine Learning Research, 2022

  43. [43]

    WaveBYOL: Self-supervised learning for audio representation from raw waveforms.IEEE Access, 11:8968–8977, 2023

    Sunghyun Kim and Yong-Hoon Choi. WaveBYOL: Self-supervised learning for audio representation from raw waveforms.IEEE Access, 11:8968–8977, 2023

  44. [44]

    WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms.arXiv preprint arXiv:2509.23238, 2025

    Goksenin Yuksel, Pierre Guetschel, Michael Tangermann, Marcel van Gerven, and Kiki van der Heijden. WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms.arXiv preprint arXiv:2509.23238, 2025

  45. [45]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023

  46. [46]

    WaveNet: A Generative Model for Raw Audio

    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016

  47. [47]

    WaveGlow: A flow-based generative network for speech synthesis

    Ryan Prenger, Rafael Valle, and Bryan Catanzaro. WaveGlow: A flow-based generative network for speech synthesis. InICASSP, 2019

  48. [48]

    Courville

    Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C. Courville. MelGAN: Generative adversarial networks for conditional waveform synthesis. InNeurIPS, 2019

  49. [49]

    Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

    Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. InICASSP, 2020

  50. [50]

    HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. InNeurIPS, 2020

  51. [51]

    SoundStream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. SoundStream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

  52. [52]

    High fidelity neural audio compression

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023

  53. [53]

    Liu, Qirui Wang, Yuan Gong, and James R

    Alexander H. Liu, Qirui Wang, Yuan Gong, and James R. Glass. A closer look at neural codec resynthesis: Bridging the gap between codec and waveform generation. InAudio Imagination: NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation, 2024

  54. [54]

    NanoCodec: Towards high-quality ultra fast speech LLM inference

    Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukic, Jason Li, and Boris Ginsburg. NanoCodec: Towards high-quality ultra fast speech LLM inference. InInterspeech, 2025

  55. [55]

    JEPA as a neural tokenizer: Learning robust speech representations with density adaptive attention

    Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang, Ravid Shwartz-Ziv, and Yann LeCun. JEPA as a neural tokenizer: Learning robust speech representations with density adaptive attention. InNeurIPS 2025 Workshop UniReps: Unifying Representations in Neural Models, 2025

  56. [56]

    Metis: A foundation speech generation model with masked generative pre-training

    Yuancheng Wang, Jiachen Zheng, Junan Zhang, Xueyao Zhang, Huan Liao, and Zhizheng Wu. Metis: A foundation speech generation model with masked generative pre-training. InNeurIPS, 2025

  57. [57]

    LibriSpeech: An ASR corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. LibriSpeech: An ASR corpus based on public domain audio books. InICASSP, 2015

  58. [58]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017

  59. [59]

    Dauphin, and David Lopez-Paz

    Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InICLR, 2018

  60. [60]

    SUPERB-SG: Enhanced speech processing universal performance benchmark for semantic and generative capabilities

    Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy Liu, Cheng-I Lai, Jiatong Shi, et al. SUPERB-SG: Enhanced speech processing universal performance benchmark for semantic and generative capabilities. InACL, 2022. 12

  61. [61]

    SUPERB@SLT 2022: Challenge on generalization and efficiency of self-supervised speech representation learning

    Tzu-hsun Feng, Annie Dong, Ching-Feng Yeh, Shu-wen Yang, Tzu-Quan Lin, Jiatong Shi, Kai-Wei Chang, Zili Huang, Haibin Wu, Xuankai Chang, et al. SUPERB@SLT 2022: Challenge on generalization and efficiency of self-supervised speech representation learning. InIEEE SLT, 2023

  62. [62]

    Taal, Richard C

    Cees H. Taal, Richard C. Hendriks, Richard Heusdens, and Jesper Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech.IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011

  63. [63]

    Rix, John G

    Antony W. Rix, John G. Beerends, Michael P. Hollier, and Andries P. Hekstra. Perceptual evaluation of speech quality (PESQ): A new method for speech quality assessment of telephone networks and codecs. In ICASSP, 2001

  64. [64]

    arXiv preprint arXiv:2204.02152 , year=

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. UTMOS: UTokyo-SaruLab system for V oiceMOS challenge 2022.arXiv preprint arXiv:2204.02152, 2022

  65. [65]

    Local encoder features

    Andrew Hines, Jan Skoglund, Anil C. Kokaram, and Naomi Harte. ViSQOL: An objective speech quality model.EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):13, 2015. 13 A Model and Pre-training Details Table 5 consolidates the main optimization, encoder, and synthesis-branch configurations used for OLIVE pre-training. Table 5: OLIVE model arch...