archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 3
-
Unified framework organizes 400 studies on speech AI bias
Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI
-
Clinician-reviewed AI creates personalized stuttering therapy plans
Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy
-
Adversarial head erases script leakage from speaker embeddings
LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation
-
Filtered generative RIRs halve speaker distance errors
Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation
-
Encoding probe reconstructs LM internals from syntax and speaker cues
Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe
-
Transformer generates ANC filters directly without decomposition
Transformer-based End-to-End Control Filter Generation for Active Noise Control
-
Pretrained video-to-audio model estimates room acoustics
MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation
-
One-step sampling matches multi-step audio quality at 8.5x speed
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
-
New pretraining creates encoder that spots voice deepfakes more reliably
Alethia: A Foundational Encoder for Voice Deepfakes
-
Pretrained embeddings classify elephant calls nearly as well as supervised models
From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings
-
Multi-band fusion lifts bioacoustics accuracy over baseband
Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification
-
New benchmark makes AVSR considerably harder than LRS3
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
-
Visual conditioning cuts WER by 16 points in overlapped conversations
BUT System Description for CHiME-9 MCoRec Challenge
-
Articulation knowledge improves speech extraction in movie audio
A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)
-
Model predicts severe stuttering events from prior three seconds
Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device
-
Embedding emotion metrics fail for speech synthesis evaluation
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
-
Language branch in discriminator keeps speaker traits intact across languages
Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification
-
Semantic priors aid speech coding only below 6 kbps
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
-
Diffusion model adds tunable prosody control to voice anonymization
DiffAnon: Diffusion-based Prosody Control for Voice Anonymization
-
Recurrence patterns in speech detect depression with AUC 0.689
Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech
-
Synthetic data improves cross-lingual science voice cloning
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
-
Cosine SupCon with delayed queue hits 8.29% ITW EER for deepfake audio
Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection
-
Human feedback restores naturalness to audio reasoning models
Step-Audio-R1.5 Technical Report
-
Fusion of noisy and enhanced speech aids speaker ID in noise
UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition
-
Semantic uncertainty beats token-level for audio LLMs
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
-
Frozen base TTS matches commercial Indic output via prompt recovery
Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost
-
Azimuth-first strips cut DOA search cost for planar mics
ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D
-
Speaker-adaptive network lifts conversation emotion accuracy
ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations
-
Rhythmic features distinguish Nyishi from Adi at 85 percent accuracy
Cross-Linguistic Rhythmic and Spectral Feature-Based Analysis of Nyishi and Adi: Two Under-Resourced Languages of Arunachal Pradesh
-
Aegyo speech raises first formant to mimic child vocal tracts
Korean aegyo speech shows systematic F1 increase to signal childlike qualities
-
Models keep 60-72% of audio scores with no sound input
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
-
Segment-level prediction cuts oversegmentation in chord recognition
An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization
-
One-step drifting field matches noisy speech to clean distributions
Speech Enhancement Based on Drifting Models
-
One-step model drifts noisy speech straight to clean distribution
Speech Enhancement Based on Drifting Models
-
The paper proposes DriftSE, a generative speech enhancement framework that uses a…
Speech Enhancement Based on Drifting Models
-
Shared high-level tokens plus separate decoders improve talking audio-video
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
-
Piano transcription pipeline separates neoclassical from historical composers by Zipf fit
An audio-to-analysis pipeline with certified transcription for information-theoretic profiling of the piano repertoire
-
Speaker recognition latent spaces form hierarchical semantic clusters
Explainable AI in Speaker Recognition -- Making Latent Representations Understandable
-
Neural predictor selects filters ahead for moving noise
Predictive Directional Selective Fixed-Filter Active Noise Control for Moving Sources via a Convolutional Recurrent Neural Network
-
Fine-tuned Whisper keeps speaker IDs consistent across audio chunks
Prompting Whisper for Joint Speech Transcription and Diarization
-
Diarization priors let LLMs handle multi-speaker ASR via dialogue queries
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models
-
Beat-guided transformer quantizes MIDI rhythms to scores
Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations
-
Hybrid DNN-search method improves audio effect estimation
Audio Effect Estimation with DNN-Based Prediction and Search Algorithm
-
Global timeline and tool reasoning sustain timing accuracy in long audio
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
-
TTS-PRISM scores Mandarin speech on 12 perceptual axes
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
-
One text-driven model generates speech
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
-
New fusion cuts Apollo speech errors by 1.1 percent
Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus
-
New speech model spots pronunciation errors without reference texts
Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis
-
Cello portamento steepness declines as performance tempo increases
Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven's Piano and Cello Sonatas, 1930--2012
-
Optical sensors record full key motion in historical instruments
PHOTON: Non-Invasive Optical Tracking of Key-Lever Motion in Historical Keyboard Instruments