APEX generates four types of prototype-based explanations for pre-trained audio classifiers that preserve output invariance and target acoustic properties better than gradient methods applied to spectrograms.
Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units
8 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 8representative citing papers
SDTalk proposes a generalizable one-shot 3DGS talking head method that uses structured facial priors for complete reconstruction and dual-branch motion fields for dynamics, outperforming prior identity-specific approaches.
A Conformer-conditioned decoder-only language model generates discrete tokens via a neural audio codec to separate four music stems, reaching near state-of-the-art perceptual quality and top NISQA on vocals in MUSDB18-HQ tests.
Acoustic scattering signals fed into fine-tuned self-supervised deep learning models classify hair type and moisture at nearly 90% accuracy as a non-invasive alternative to visual methods.
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
An adaptive cross-modal gating network improves depression detection from speech by selectively weighting sparse relevant segments across acoustic and textual modalities.
Cross-lifespan evaluation shows adult-trained speech foundation models degrade on child and older-adult data, with joint multi-age training and targeted adaptation improving robustness especially using Whisper encoder.
citing papers explorer
-
APEX: Audio Prototype EXplanations for Classification Tasks
APEX generates four types of prototype-based explanations for pre-trained audio classifiers that preserve output invariance and target acoustic properties better than gradient methods applied to spectrograms.
-
SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis
SDTalk proposes a generalizable one-shot 3DGS talking head method that uses structured facial priors for complete reconstruction and dual-branch motion fields for dynamics, outperforming prior identity-specific approaches.
-
Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models
A Conformer-conditioned decoder-only language model generates discrete tokens via a neural audio codec to separate four music stems, reaching near state-of-the-art perceptual quality and top NISQA on vocals in MUSDB18-HQ tests.
-
Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment
Acoustic scattering signals fed into fine-tuned self-supervised deep learning models classify hair type and moisture at nearly 90% accuracy as a non-invasive alternative to visual methods.
-
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
-
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
-
Learning to Attend to Depression-Related Patterns: An Adaptive Cross-Modal Gating Network for Depression Detection
An adaptive cross-modal gating network improves depression detection from speech by selectively weighting sparse relevant segments across acoustic and textual modalities.
-
Exploring Speech Foundation Models for Speaker Diarization Across Lifespan
Cross-lifespan evaluation shows adult-trained speech foundation models degrade on child and older-adult data, with joint multi-age training and targeted adaptation improving robustness especially using Whisper encoder.