hub

arXiv preprint arXiv:2105.01051 , year=

· 2021 · arXiv 2105.01051

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

cs.SD · 2026-06-17 · unverdicted · novelty 7.0

S-JEPA uses soft GMM posteriors in a JEPA framework for self-supervised speech learning, achieving lowest WER below 90M parameters without offline re-clustering.

Multi-layer attentive probing improves transfer of audio representations for bioacoustics

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

Multi-layer attentive probing outperforms last-layer linear probing for transferring audio representations to bioacoustic tasks, indicating that standard evaluation setups may underestimate model quality.

MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios

eess.AS · 2026-06-22 · unverdicted · novelty 6.0

MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.

End-to-End Voice Intent Recognition for Spontaneous Human-Drone Interaction with Naive Users

eess.AS · 2026-06-19 · unverdicted · novelty 6.0

An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.

SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails

eess.AS · 2026-06-05 · conditional · novelty 6.0

SEAM achieves 0.971 ROC-AUC on external interview data for real-time scripted speech detection by combining shortcut-prevention data techniques with a compact audio backbone.

AudioMosaic: Contrastive Masked Audio Representation Learning

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.

Alethia: A Foundational Encoder for Voice Deepfakes

cs.SD · 2026-04-30 · unverdicted · novelty 6.0

Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness and zero-shot generalization.

StressTest: Can YOUR Speech LM Handle the Stress?

cs.CL · 2025-05-28 · conditional · novelty 6.0

Speech language models fail at reasoning about sentence stress but improve after fine-tuning on a new 17k-example synthetic dataset that varies stress to alter meaning.

SIGMA: Saliency-Guided Sparse Mask Attacks for Speech Emotion Recognition

cs.SD · 2026-06-29 · unverdicted · novelty 5.0

SIGMA applies post-hoc XAI saliency maps to define reusable sparse masks for magnitude-bounded perturbations on self-supervised speech features, evaluated on IEMOCAP and TESS for competitive attack success with explanation consistency trade-offs.

Impact Analysis of Speech Representation Learning Models for Acoustic Side-Channel Attack

cs.CR · 2026-06-19 · unverdicted · novelty 5.0 · 2 refs

KEYAC dataset benchmarks speech models for keyboard acoustic side-channel attacks, with KAN fine-tuning setting new SOTA by addressing nonlinear feature interactions.

A Unified and Reproducible Experimentation Framework for Speech Understanding

eess.AS · 2026-05-29 · unverdicted · novelty 5.0

SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.

ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals

eess.AS · 2026-04-08 · unverdicted · novelty 5.0

ULTRAS unifies audio and speech representation learning in a single transformer by applying patch masking to log-mel spectrograms and using a joint spectral-temporal prediction loss.

Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition

eess.AS · 2025-09-10 · unverdicted · novelty 5.0

Sparse MERIT uses frame-wise sparse mixture-of-experts with task-specific gating on self-supervised speech features to jointly optimize enhancement and emotion recognition, reporting gains over baselines on MSP-Podcast at low SNR.

AVEX: What Matters for Animal Vocalization Encoding

cs.SD · 2025-08-15 · unverdicted · novelty 5.0

Large empirical study finds self-supervised pre-training then supervised post-training on mixed bioacoustics and general audio data produces the strongest encoders across 26 datasets for species classification, detection, individual ID and repertoire discovery.

MOSS-Audio Technical Report

cs.SD · 2026-06-01 · unverdicted · novelty 4.0

MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.

From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning

eess.AS · 2026-07-01 · unverdicted · novelty 3.0

A survey that organizes audio SSL into five objective paradigms, relates their demands to architectural biases, and interprets downstream applications as tests of generalization.

Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

cs.CL · 2026-05-16 · unverdicted · novelty 2.0

A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.

citing papers explorer

Showing 17 of 17 citing papers.

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning cs.SD · 2026-06-17 · unverdicted · none · ref 43
S-JEPA uses soft GMM posteriors in a JEPA framework for self-supervised speech learning, achieving lowest WER below 90M parameters without offline re-clustering.
Multi-layer attentive probing improves transfer of audio representations for bioacoustics cs.SD · 2026-05-11 · unverdicted · none · ref 9
Multi-layer attentive probing outperforms last-layer linear probing for transferring audio representations to bioacoustic tasks, indicating that standard evaluation setups may underestimate model quality.
MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios eess.AS · 2026-06-22 · unverdicted · none · ref 18
MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.
End-to-End Voice Intent Recognition for Spontaneous Human-Drone Interaction with Naive Users eess.AS · 2026-06-19 · unverdicted · none · ref 31
An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.
SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails eess.AS · 2026-06-05 · conditional · none · ref 25
SEAM achieves 0.971 ROC-AUC on external interview data for real-time scripted speech detection by combining shortcut-prevention data techniques with a compact audio backbone.
AudioMosaic: Contrastive Masked Audio Representation Learning cs.LG · 2026-05-14 · unverdicted · none · ref 15
AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.
Alethia: A Foundational Encoder for Voice Deepfakes cs.SD · 2026-04-30 · unverdicted · none · ref 45
Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness and zero-shot generalization.
StressTest: Can YOUR Speech LM Handle the Stress? cs.CL · 2025-05-28 · conditional · none · ref 6
Speech language models fail at reasoning about sentence stress but improve after fine-tuning on a new 17k-example synthetic dataset that varies stress to alter meaning.
SIGMA: Saliency-Guided Sparse Mask Attacks for Speech Emotion Recognition cs.SD · 2026-06-29 · unverdicted · none · ref 25
SIGMA applies post-hoc XAI saliency maps to define reusable sparse masks for magnitude-bounded perturbations on self-supervised speech features, evaluated on IEMOCAP and TESS for competitive attack success with explanation consistency trade-offs.
Impact Analysis of Speech Representation Learning Models for Acoustic Side-Channel Attack cs.CR · 2026-06-19 · unverdicted · none · ref 21 · 2 links
KEYAC dataset benchmarks speech models for keyboard acoustic side-channel attacks, with KAN fine-tuning setting new SOTA by addressing nonlinear feature interactions.
A Unified and Reproducible Experimentation Framework for Speech Understanding eess.AS · 2026-05-29 · unverdicted · none · ref 22
SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.
ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals eess.AS · 2026-04-08 · unverdicted · none · ref 17
ULTRAS unifies audio and speech representation learning in a single transformer by applying patch masking to log-mel spectrograms and using a joint spectral-temporal prediction loss.
Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition eess.AS · 2025-09-10 · unverdicted · none · ref 50
Sparse MERIT uses frame-wise sparse mixture-of-experts with task-specific gating on self-supervised speech features to jointly optimize enhancement and emotion recognition, reporting gains over baselines on MSP-Podcast at low SNR.
AVEX: What Matters for Animal Vocalization Encoding cs.SD · 2025-08-15 · unverdicted · none · ref 27
Large empirical study finds self-supervised pre-training then supervised post-training on mixed bioacoustics and general audio data produces the strongest encoders across 26 datasets for species classification, detection, individual ID and repertoire discovery.
MOSS-Audio Technical Report cs.SD · 2026-06-01 · unverdicted · none · ref 21
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning eess.AS · 2026-07-01 · unverdicted · none · ref 111
A survey that organizes audio SSL into five objective paradigms, relates their demands to architectural biases, and interprets downstream applications as tests of generalization.
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages cs.CL · 2026-05-16 · unverdicted · none · ref 228
A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.

arXiv preprint arXiv:2105.01051 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer