archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 3

eess.AS 2026-05-02 reviewed

Unified framework organizes 400 studies on speech AI bias
Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI

Yi-Cheng Lin +5
cs.AI 2026-05-01 reviewed

Clinician-reviewed AI creates personalized stuttering therapy plans
Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

Shakeel Sheikh +6
cs.SD 2026-05-01 reviewed

Adversarial head erases script leakage from speaker embeddings
LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Venkata Pushpak Teja Menta
cs.SD 2026-05-01 reviewed

Filtered generative RIRs halve speaker distance errors
Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation

Anton Ratnarajah +3
cs.CL 2026-05-01 reviewed

Encoding probe reconstructs LM internals from syntax and speaker cues
Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe

Gaofei Shen +4
eess.AS 2026-05-01 reviewed

Transformer generates ANC filters directly without decomposition
Transformer-based End-to-End Control Filter Generation for Active Noise Control

Ziyi Yang +5
cs.SD 2026-05-01 reviewed

Pretrained video-to-audio model estimates room acoustics
MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation

Akira Takahashi +3
cs.SD 2026-05-01 reviewed

One-step sampling matches multi-step audio quality at 8.5x speed
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

Kuan-Po Huang +10
cs.SD 2026-04-30 reviewed

New pretraining creates encoder that spots voice deepfakes more reliably
Alethia: A Foundational Encoder for Voice Deepfakes

Yi Zhu +3
eess.AS 2026-04-30 reviewed

Pretrained embeddings classify elephant calls nearly as well as supervised models
From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings

Christiaan M. Geldenhuys +1
cs.LG 2026-04-30 reviewed

Multi-band fusion lifts bioacoustics accuracy over baseband
Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification

Eklavya Sarkar +8
eess.AS 2026-04-30 reviewed

New benchmark makes AVSR considerably harder than LRS3
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Doyeop Kwak +3
eess.AS 2026-04-30 reviewed

Visual conditioning cuts WER by 16 points in overlapped conversations
BUT System Description for CHiME-9 MCoRec Challenge

Dominik Klement +4
eess.AS 2026-04-30 reviewed

Articulation knowledge improves speech extraction in movie audio
A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)

Chun-wei Ho +3
cs.SD 2026-04-30 reviewed

Model predicts severe stuttering events from prior three seconds
Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device

Nazar Kozak
eess.AS 2026-04-29 reviewed

Embedding emotion metrics fail for speech synthesis evaluation
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Yun-Shao Tsai +7
eess.AS 2026-04-29 reviewed

Language branch in discriminator keeps speaker traits intact across languages
Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification

Qituan Shangguan +7
eess.AS 2026-04-29 reviewed

Semantic priors aid speech coding only below 6 kbps
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

Mingyu Zhao +3
eess.AS 2026-04-29 reviewed

Diffusion model adds tunable prosody control to voice anonymization
DiffAnon: Diffusion-based Prosody Control for Voice Anonymization

Ismail Rasim Ulgen +4
cs.SD 2026-04-29 reviewed

Recurrence patterns in speech detect depression with AUC 0.689
Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech

Himadri S Samanta
eess.AS 2026-04-28 reviewed

Synthetic data improves cross-lingual science voice cloning
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

Amanuel Gizachew Abebe +1
eess.AS 2026-04-28 reviewed

Cosine SupCon with delayed queue hits 8.29% ITW EER for deepfake audio
Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

Jaskirat Sudan +3
eess.AS 2026-04-28 reviewed

Human feedback restores naturalness to audio reasoning models
Step-Audio-R1.5 Technical Report

Yuxin Zhang +18
eess.AS 2026-04-28 reviewed

Fusion of noisy and enhanced speech aids speaker ID in noise
UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

Chong-Xin Gan +6
eess.AS 2026-04-28 reviewed

Semantic uncertainty beats token-level for audio LLMs
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

Chun-Yi Kuan +2
cs.SD 2026-04-28 reviewed

Frozen base TTS matches commercial Indic output via prompt recovery
Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Venkata Pushpak Teja Menta
eess.AS 2026-04-28 reviewed

Azimuth-first strips cut DOA search cost for planar mics
ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D

Ming Huang +8
cs.SD 2026-04-28 reviewed

Speaker-adaptive network lifts conversation emotion accuracy
ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

Kexue Wang +2
eess.AS 2026-04-28 reviewed

Rhythmic features distinguish Nyishi from Adi at 85 percent accuracy
Cross-Linguistic Rhythmic and Spectral Feature-Based Analysis of Nyishi and Adi: Two Under-Resourced Languages of Arunachal Pradesh

Deepshikha Gogoi +2
cs.CL 2026-04-28 reviewed

Aegyo speech raises first formant to mimic child vocal tracts
Korean aegyo speech shows systematic F1 increase to signal childlike qualities

Ji-eun Kim +1
cs.SD 2026-04-27 reviewed

Models keep 60-72% of audio scores with no sound input
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Leonardo Haw-Yang Foo +4
cs.SD 2026-04-27 reviewed

Segment-level prediction cuts oversegmentation in chord recognition
An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization

Leekyung Kim +1
cs.SD 2026-04-27 reviewed

One-step drifting field matches noisy speech to clean distributions
Speech Enhancement Based on Drifting Models

Liang Xu +4
cs.SD 2026-04-27 reviewed

One-step model drifts noisy speech straight to clean distribution
Speech Enhancement Based on Drifting Models

Liang Xu +4
cs.SD 2026-04-27 reviewed

The paper proposes DriftSE, a generative speech enhancement framework that uses a…
Speech Enhancement Based on Drifting Models

Liang Xu +4
cs.CV 2026-04-26 reviewed

Shared high-level tokens plus separate decoders improve talking audio-video
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Zhen Ye +10
cs.SD 2026-04-25 reviewed

Piano transcription pipeline separates neoclassical from historical composers by Zipf fit
An audio-to-analysis pipeline with certified transcription for information-theoretic profiling of the piano repertoire

Fred Jalbert-Desforges
eess.AS 2026-04-25 reviewed

Speaker recognition latent spaces form hierarchical semantic clusters
Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

Yanze Xu +2
eess.AS 2026-04-25 reviewed

Neural predictor selects filters ahead for moving noise
Predictive Directional Selective Fixed-Filter Active Noise Control for Moving Sources via a Convolutional Recurrent Neural Network

Boxiang Wang +5
eess.AS 2026-04-24 reviewed

Fine-tuned Whisper keeps speaker IDs consistent across audio chunks
Prompting Whisper for Joint Speech Transcription and Diarization

Mariia Zamyrova +1
eess.AS 2026-04-24 reviewed

Diarization priors let LLMs handle multi-speaker ASR via dialogue queries
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

Li Li +5
cs.SD 2026-04-24 reviewed

Beat-guided transformer quantizes MIDI rhythms to scores
Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

Maximilian Wachter +2
eess.AS 2026-04-24 reviewed

Hybrid DNN-search method improves audio effect estimation
Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

Youichi Okita +1
eess.AS 2026-04-24 reviewed

Global timeline and tool reasoning sustain timing accuracy in long audio
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

Mingchen Shao +8
cs.CL 2026-04-24 reviewed

TTS-PRISM scores Mandarin speech on 12 perceptual axes
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

Xi Wang +10
eess.AS 2026-04-24 reviewed

One text-driven model generates speech
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

Chunyu Qiang +13
eess.AS 2026-04-24 reviewed

New fusion cuts Apollo speech errors by 1.1 percent
Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

Szu-Jui Chen +1
eess.AS 2026-04-24 reviewed

New speech model spots pronunciation errors without reference texts
Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis

Haopeng Geng +5
cs.SD 2026-04-23 reviewed

Cello portamento steepness declines as performance tempo increases
Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven's Piano and Cello Sonatas, 1930--2012

Ignasi Sole
eess.AS 2026-04-23 reviewed

Optical sensors record full key motion in historical instruments
PHOTON: Non-Invasive Optical Tracking of Key-Lever Motion in Historical Keyboard Instruments

Noah Jaffe +1