archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 7
-
Single-layer tokenizer separates speaker identity from speech phonetics
Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling
-
CALM halves biased errors in two-speaker ASR
CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR
-
Open ASR models reach parity with proprietary APIs on 52 languages
Qwen3-ASR Technical Report
-
Brief spatial sounds convey direction in XR
Evaluating Spatialized Auditory Cues for Rapid Attention Capture in XR
-
Learnable projector cuts prompt sensitivity in LLM speech recognition
Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection
-
Longest utterances cut speech pre-training data in half
A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models
-
Detector catches deepfake greetings in 0.5 seconds
Audio Deepfake Detection at the First Greeting: "Hi!"
-
Noise rejection lifts heart-sound CAD detection by 4 points
Noise-Robust Contrastive Learning with an MFCC-Conformer For Coronary Artery Disease Detection
-
One model covers speech, expressive, and singing voice conversion
OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion
-
SWIM scales real-time ASR to 20 clients via buffer merging
Sink or SWIM: Tackling Real-Time ASR at Scale
-
Hybrid algorithm gives fast noise control with low error and high stability
A Stabilized Hybrid Active Noise Control Algorithm of GFANC and FxNLMS with Online Clustering
-
Qwen3-TTS reaches SOTA multilingual TTS with 3-second cloning
Qwen3-TTS Technical Report
-
Fast-ULCNet halves model size and cuts latency 34% for speech enhancement
Fast-ULCNet: A fast and ultra low complexity network for single-channel speech enhancement
-
Mask polarization restores decisive outputs for speech enhancement at test time
Test-Time Adaptation For Speech Enhancement Via Mask Polarization
-
Curvature-guided merge cuts forgetting in ASR continual learning
Inverse-Hessian Regularization for Continual Learning in ASR
-
Audio QA models miss when questions have no answer
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
-
Self-reflection step raises speech recognition accuracy by 12.1% WER
Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception
-
Hybrid model improves quality and consistency in speaker extraction
Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models
-
RL alignment closes speech-text reasoning gap in LLMs
Closing the Modality Reasoning Gap for Speech Large Language Models
-
Low-frequency loss weighting solves delay learning in effect models
Gradient-based Optimisation of Modulation Effects
-
Encoder tracks speakers and timing together in one pass
TellWhisper: Tell Whisper Who Speaks When
-
ReStyle-TTS enables continuous relative style control in zero-shot TTS
ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis
-
Smart Embedding halves parameters in polyphonic music models
Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias
-
Fine-grained captions train multi-granular speech-text model
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
-
Semantic neighbors fix prompt tuning for audio models
Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion
-
Hybrid Mamba-Attention backbone matches SOTA on audio deepfake detection
XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection
-
Per-layer compensation lowers word errors in low-bit ASR models
Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models
-
Anti-aliasing modules improve neural music and singing audio quality
Aliasing-Free Neural Audio Synthesis
-
Noise modeling gives accurate FDN filters from noisy impulse responses
Learning Filters in Feedback Delay Networks from Noisy Room Impulse Responses
-
Reserve retraining blocks poisoning in federated audio models
REVERB-FL: Server-Side Adversarial and Reserve-Enhanced Federated Learning for Robust Audio Classification
-
Decoders adapt to degradation while encoders stay invariant
Where Does Speech Enhancement Adapt? Probing Study Under Controlled Degradation
-
Orchestral dataset supplies isolated stems for source separation
The Spheres Dataset: Multitrack Orchestral Recordings for Music Source Separation and Information Retrieval
-
Music language model fixes vocal pitch without references
BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference
-
Benchmark shows speech models falter over repeated conversation turns
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
-
Dynamic int8 cuts Whisper-small size 57% while improving accuracy
Quantizing Whisper-small: How design choices affect ASR performance
-
Older adults match or beat young listeners with simulated hearing loss
Disentangling peripheral hearing loss from central and cognitive effects on speech intelligibility in older adults
-
Sound localization maps tool actions onto 3D surgical scenes
Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes
-
EMG signals map to speech model space for direct audio synthesis
emg2speech: Synthesizing speech from electromyography using self-supervised speech models
-
MBR decoding beats beam search on ASR accuracy
Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition
-
Replay-inclusive dataset lifts deepfake detector accuracy
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
-
RFM steering raises music note accuracy from 0.23 to 0.82
Steering Autoregressive Music Generation with Recursive Feature Machines
-
New model handles listen, look, speak and act together
End-to-end Listen, Look, Speak and Act
-
LLM judges rate speech quality with explanations across languages
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
-
Interleaved tokens unify speech and gesture synthesis
Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction
-
Coefficient search in latent subspace adapts models with 63x less compute
Efficient Test-Time Adaptation through Latent Subspace Coefficients Search
-
Fine-tuned video-to-audio model separates sounds while keeping generation ability
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
-
Progressive diffusion adds timing and clarity to text audio
ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
-
Model subtraction fixes pseudo-label errors in speech AI
Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition
-
Tests reveal full-duplex systems confuse on overlaps and corrections
Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner
-
VAPO stops AI from reading slides instead of listening
VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models