S-JEPA uses soft GMM posteriors in a JEPA framework for self-supervised speech learning, achieving lowest WER below 90M parameters without offline re-clustering.
hub
arXiv preprint arXiv:2105.01051 , year=
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Multi-layer attentive probing outperforms last-layer linear probing for transferring audio representations to bioacoustic tasks, indicating that standard evaluation setups may underestimate model quality.
MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.
An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.
SEAM achieves 0.971 ROC-AUC on external interview data for real-time scripted speech detection by combining shortcut-prevention data techniques with a compact audio backbone.
AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.
Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness and zero-shot generalization.
Speech language models fail at reasoning about sentence stress but improve after fine-tuning on a new 17k-example synthetic dataset that varies stress to alter meaning.
SIGMA applies post-hoc XAI saliency maps to define reusable sparse masks for magnitude-bounded perturbations on self-supervised speech features, evaluated on IEMOCAP and TESS for competitive attack success with explanation consistency trade-offs.
KEYAC dataset benchmarks speech models for keyboard acoustic side-channel attacks, with KAN fine-tuning setting new SOTA by addressing nonlinear feature interactions.
SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.
ULTRAS unifies audio and speech representation learning in a single transformer by applying patch masking to log-mel spectrograms and using a joint spectral-temporal prediction loss.
Sparse MERIT uses frame-wise sparse mixture-of-experts with task-specific gating on self-supervised speech features to jointly optimize enhancement and emotion recognition, reporting gains over baselines on MSP-Podcast at low SNR.
Large empirical study finds self-supervised pre-training then supervised post-training on mixed bioacoustics and general audio data produces the strongest encoders across 26 datasets for species classification, detection, individual ID and repertoire discovery.
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
A survey that organizes audio SSL into five objective paradigms, relates their demands to architectural biases, and interprets downstream applications as tests of generalization.
A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.
citing papers explorer
-
S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning
S-JEPA uses soft GMM posteriors in a JEPA framework for self-supervised speech learning, achieving lowest WER below 90M parameters without offline re-clustering.
-
Multi-layer attentive probing improves transfer of audio representations for bioacoustics
Multi-layer attentive probing outperforms last-layer linear probing for transferring audio representations to bioacoustic tasks, indicating that standard evaluation setups may underestimate model quality.
-
MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios
MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.
-
End-to-End Voice Intent Recognition for Spontaneous Human-Drone Interaction with Naive Users
An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.
-
SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails
SEAM achieves 0.971 ROC-AUC on external interview data for real-time scripted speech detection by combining shortcut-prevention data techniques with a compact audio backbone.
-
AudioMosaic: Contrastive Masked Audio Representation Learning
AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.
-
Alethia: A Foundational Encoder for Voice Deepfakes
Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness and zero-shot generalization.
-
StressTest: Can YOUR Speech LM Handle the Stress?
Speech language models fail at reasoning about sentence stress but improve after fine-tuning on a new 17k-example synthetic dataset that varies stress to alter meaning.
-
SIGMA: Saliency-Guided Sparse Mask Attacks for Speech Emotion Recognition
SIGMA applies post-hoc XAI saliency maps to define reusable sparse masks for magnitude-bounded perturbations on self-supervised speech features, evaluated on IEMOCAP and TESS for competitive attack success with explanation consistency trade-offs.
-
Impact Analysis of Speech Representation Learning Models for Acoustic Side-Channel Attack
KEYAC dataset benchmarks speech models for keyboard acoustic side-channel attacks, with KAN fine-tuning setting new SOTA by addressing nonlinear feature interactions.
-
A Unified and Reproducible Experimentation Framework for Speech Understanding
SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.
-
ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals
ULTRAS unifies audio and speech representation learning in a single transformer by applying patch masking to log-mel spectrograms and using a joint spectral-temporal prediction loss.
-
Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition
Sparse MERIT uses frame-wise sparse mixture-of-experts with task-specific gating on self-supervised speech features to jointly optimize enhancement and emotion recognition, reporting gains over baselines on MSP-Podcast at low SNR.
-
AVEX: What Matters for Animal Vocalization Encoding
Large empirical study finds self-supervised pre-training then supervised post-training on mixed bioacoustics and general audio data produces the strongest encoders across 26 datasets for species classification, detection, individual ID and repertoire discovery.
-
MOSS-Audio Technical Report
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
-
From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning
A survey that organizes audio SSL into five objective paradigms, relates their demands to architectural biases, and interprets downstream applications as tests of generalization.
-
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages
A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.