archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 7

cs.CL 2026-01-31 reviewed

Single-layer tokenizer separates speaker identity from speech phonetics
Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

Zhijie Huang +3
eess.AS 2026-01-30 reviewed

CALM halves biased errors in two-speaker ASR
CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Muhammad Shakeel +4
cs.CL 2026-01-29 reviewed

Open ASR models reach parity with proprietary APIs on 52 languages
Qwen3-ASR Technical Report

Xian Shi +12
cs.HC 2026-01-29 reviewed

Brief spatial sounds convey direction in XR
Evaluating Spatialized Auditory Cues for Rapid Attention Capture in XR

Yoonsang Kim +2
eess.AS 2026-01-28 reviewed

Learnable projector cuts prompt sensitivity in LLM speech recognition
Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection

Sergio Burdisso +9
cs.SD 2026-01-28 reviewed

Longest utterances cut speech pre-training data in half
A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models

Ryan Whetten +3
eess.AS 2026-01-27 reviewed

Detector catches deepfake greetings in 0.5 seconds
Audio Deepfake Detection at the First Greeting: "Hi!"

Haohan Shi +4
eess.AS 2026-01-26 reviewed

Noise rejection lifts heart-sound CAD detection by 4 points
Noise-Robust Contrastive Learning with an MFCC-Conformer For Coronary Artery Disease Detection

Milan Marocchi +2
eess.AS 2026-01-26 reviewed

One model covers speech, expressive, and singing voice conversion
OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion

Zhichao Wang +5
cs.SD 2026-01-22 reviewed

SWIM scales real-time ASR to 20 clients via buffer merging
Sink or SWIM: Tackling Real-Time ASR at Scale

Federico Bruzzone +3
eess.AS 2026-01-22 reviewed

Hybrid algorithm gives fast noise control with low error and high stability
A Stabilized Hybrid Active Noise Control Algorithm of GFANC and FxNLMS with Online Clustering

Zhengding Luo +5
cs.SD 2026-01-22 reviewed

Qwen3-TTS reaches SOTA multilingual TTS with 3-second cloning
Qwen3-TTS Technical Report

Hangrui Hu +15
eess.AS 2026-01-21 reviewed

Fast-ULCNet halves model size and cuts latency 34% for speech enhancement
Fast-ULCNet: A fast and ultra low complexity network for single-channel speech enhancement

Nicol\'as Arrieta Larraza +1
eess.AS 2026-01-21 reviewed

Mask polarization restores decisive outputs for speech enhancement at test time
Test-Time Adaptation For Speech Enhancement Via Mask Polarization

Tobias Raichle +2
eess.AS 2026-01-21 reviewed

Curvature-guided merge cuts forgetting in ASR continual learning
Inverse-Hessian Regularization for Continual Learning in ASR

Steven Vander Eeckt +1
eess.AS 2026-01-18 reviewed

Audio QA models miss when questions have no answer
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering

Chun-Yi Kuan +1
cs.SD 2026-01-14 reviewed

Self-reflection step raises speech recognition accuracy by 12.1% WER
Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Zhen Wan +17
eess.AS 2026-01-09 reviewed

Hybrid model improves quality and consistency in speaker extraction
Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

Bang Zeng +3
cs.CL 2026-01-09 reviewed

RL alignment closes speech-text reasoning gap in LLMs
Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang +6
eess.AS 2026-01-08 reviewed

Low-frequency loss weighting solves delay learning in effect models
Gradient-based Optimisation of Modulation Effects

Alistair Carson +2
eess.AS 2026-01-07 reviewed

Encoder tracks speakers and timing together in one pass
TellWhisper: Tell Whisper Who Speaks When

Yifan Hu +4
eess.AS 2026-01-07 reviewed

ReStyle-TTS enables continuous relative style control in zero-shot TTS
ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

Haitao Li +5
cs.LG 2026-01-07 reviewed

Smart Embedding halves parameters in polyphonic music models
Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

Joonwon Seo
eess.AS 2026-01-06 reviewed

Fine-grained captions train multi-granular speech-text model
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

Yifan Yang +10
cs.SD 2026-01-06 reviewed

Semantic neighbors fix prompt tuning for audio models
Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion

Jaehyuk Jang +3
eess.AS 2026-01-06 reviewed

Hybrid Mamba-Attention backbone matches SOTA on audio deepfake detection
XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection

Kwok-Ho Ng +3
cs.SD 2026-01-05 reviewed

Per-layer compensation lowers word errors in low-bit ASR models
Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models

Xinyu Wang +8
cs.SD 2025-12-23 reviewed

Anti-aliasing modules improve neural music and singing audio quality
Aliasing-Free Neural Audio Synthesis

Yicheng Gu +5
eess.AS 2025-12-18 reviewed

Noise modeling gives accurate FDN filters from noisy impulse responses
Learning Filters in Feedback Delay Networks from Noisy Room Impulse Responses

Gloria Dal Santo +3
eess.AS 2025-12-15 reviewed

Reserve retraining blocks poisoning in federated audio models
REVERB-FL: Server-Side Adversarial and Reserve-Enhanced Federated Learning for Robust Audio Classification

Sathwika Peechara +1
eess.AS 2025-11-29 reviewed

Decoders adapt to degradation while encoders stay invariant
Where Does Speech Enhancement Adapt? Probing Study Under Controlled Degradation

Yair Amar +2
eess.AS 2025-11-26 reviewed

Orchestral dataset supplies isolated stems for source separation
The Spheres Dataset: Multitrack Orchestral Recordings for Music Source Separation and Information Retrieval

Jaime Garcia-Martinez +7
eess.AS 2025-11-25 reviewed

Music language model fixes vocal pitch without references
BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

Sungjae Kim +3
cs.CL 2025-11-13 reviewed

Benchmark shows speech models falter over repeated conversation turns
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

He Zhang +7
eess.AS 2025-11-11 reviewed

Dynamic int8 cuts Whisper-small size 57% while improving accuracy
Quantizing Whisper-small: How design choices affect ASR performance

Arthur S\"ohler +2
eess.AS 2025-10-29 reviewed

Older adults match or beat young listeners with simulated hearing loss
Disentangling peripheral hearing loss from central and cognitive effects on speech intelligibility in older adults

Toshio Irino +2
cs.SD 2025-10-28 reviewed

Sound localization maps tool actions onto 3D surgical scenes
Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes

Jonas Hein +5
cs.SD 2025-10-28 reviewed

EMG signals map to speech model space for direct audio synthesis
emg2speech: Synthesizing speech from electromyography using self-supervised speech models

Harshavardhana T. Gowda +2
cs.CL 2025-10-22 reviewed

MBR decoding beats beam search on ASR accuracy
Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition

Yuu Jinnai
eess.AS 2025-10-22 reviewed

Replay-inclusive dataset lifts deepfake detector accuracy
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

Tong Zhang +2
cs.LG 2025-10-21 reviewed

RFM steering raises music note accuracy from 0.23 to 0.82
Steering Autoregressive Music Generation with Recursive Feature Machines

Daniel Zhao +4
cs.AI 2025-10-19 reviewed

New model handles listen, look, speak and act together
End-to-end Listen, Look, Speak and Act

Siyin Wang +6
cs.SD 2025-10-16 reviewed

LLM judges rate speech quality with explanations across languages
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

Hui Wang +11
cs.SD 2025-10-13 reviewed

Interleaved tokens unify speech and gesture synthesis
Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

T\'eo Guichoux +7
cs.LG 2025-10-13 reviewed

Coefficient search in latent subspace adapts models with 63x less compute
Efficient Test-Time Adaptation through Latent Subspace Coefficients Search

Xinyu Luo +6
cs.SD 2025-10-10 reviewed

Fine-tuned video-to-audio model separates sounds while keeping generation ability
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

Akira Takahashi +2
cs.SD 2025-10-10 reviewed

Progressive diffusion adds timing and clarity to text audio
ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Yuxuan Jiang +5
eess.AS 2025-10-09 reviewed

Model subtraction fixes pseudo-label errors in speech AI
Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition

Yi-Cheng Lin +6
eess.AS 2025-10-09 reviewed

Tests reveal full-duplex systems confuse on overlaps and corrections
Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

Guan-Ting Lin +6
eess.AS 2025-10-08 reviewed

VAPO stops AI from reading slides instead of listening
VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

Rui Hu +4