archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 1

eess.AS 2026-05-22 reviewed

EMA and dual scoring produce TTS hardest to detect in WildSpoof
Natural Yet Challenging to Detect: Robust In-the-Wild TTS through EMA and Dual-Scoring Prompt Selection -- Submission for WildSpoof 2026 TTS Track

Renhe Sun +4
eess.AS 2026-05-22 reviewed

Frame-aligned fusion of two encoders cuts error in hearing-aid intelligibility prediction
Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech

Kazushi Nakazawa
eess.AS 2026-05-22 reviewed

Acoustic fusion raises intelligibility correlation to 0.806
Word-Level Modeling with Alignment-Aware Acoustic Fusion for Text-Assisted Intelligibility Prediction in Listeners with Hearing Loss

Kazushi Nakazawa
eess.AS 2026-05-22 reviewed

Two-stage training matches full phoneme scoring with few labels
A study on weakly-supervised training approaches for phoneme-level pronunciation scoring

Jazm\'in Vidal +1
eess.AS 2026-05-22 reviewed

One model tops benchmarks in speech recognition
StepAudio 2.5 Technical Report

Bin Lin +100
eess.AS 2026-05-22 reviewed

Integrated gradients localize sound events at 0.39 IoU
Evaluating the Temporal Detection Capability of Integrated Gradients Applied on Sound Classifier

Martynas Dumpis +1
eess.AS 2026-05-22 reviewed

One model judges speech across many tasks with reasoning
UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment

Yuanyuan Wang +6
cs.LG 2026-05-21 reviewed

Plug-in losses approximate EDL objectives with decaying error
Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

Berk Hayta +3
cs.AI 2026-05-21 reviewed

LLM analysis outperforms acoustics for political pathos
Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

Juergen Dietrich
cs.SD 2026-05-21 reviewed

Audio denoiser infers scene to keep relevant sounds
Automatic Contextual Audio Denoising

Diep Luong +3
eess.AS 2026-05-21 reviewed

Dual-stage phoneme search raises user keyword spotting to 97.85% AUC
Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation

Zhiqi Ai +5
cs.SD 2026-05-21 reviewed

Augmentations reduce TTS word error rate from 1.44 to 1.38
RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

Jinhyeok Yang +5
eess.AS 2026-05-21 reviewed

Neighbor consistency cuts sound zone variation by over 50 percent
Neighbor-Consistent Neural Filters for Robust Personal Sound Zones Under Localization Uncertainty

Hao Jiang +1
eess.AS 2026-05-20 reviewed

Embeddings cluster speech degradations for better detection
Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals

Michael Kuhlmann +2
eess.AS 2026-05-20 reviewed

Neural beamformer outperforms LCMV by learning constrained weights
Linearly Constrained Deep Beamformer for Multi-Speaker Scenarios

Ilai Zaidel +3
eess.AS 2026-05-20 reviewed

Survey unifies audio reasoning approaches in foundation models
A Survey of Audio Reasoning in Multimodal Foundation Models

Zhihan Guo +10
eess.AS 2026-05-20 reviewed

Neural net predicts room acoustics from geometry and materials
From Numbers to Perception, Energy Decay Curves Prediction

Imran Muhammad +1
cs.SD 2026-05-20 reviewed

Tropical bird detector trained on 50k-clip dataset hits 99.57% accuracy
SEABAD: A Tropical Bird Activity Detection Dataset for Passive Acoustic Monitoring

Muhammad Mun'im Ahmad Zabidi +2
eess.AS 2026-05-20 reviewed

Public speech data powers TTS models matching closed systems
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

Semin Kim +10
eess.AS 2026-05-20 reviewed

Full-duplex model speaks and acts on the same 160 ms clock
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

Haoyang Zhang +15
eess.AS 2026-05-19 reviewed

Planning step and targeted retrieval stabilize accuracy on longer audio
PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding

Masao Someki +19
eess.AS 2026-05-19 reviewed

Causal estimator improves short-window sound field reconstruction
Causal Spatio-Temporal Sound Field Reconstruction

David Sundstr\"om +3
cs.SD 2026-05-19 reviewed

Scaled simulations cut speech recognition errors over 30 percent
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Zhifei Xie +6
eess.AS 2026-05-19 reviewed

Cross-talk reduction on close-talk mics yields SOTA far-field separation
Cross-Talk Speech Reduction, by Separation, for Separation

Zhong-Qiu Wang +1
eess.AS 2026-05-19 reviewed

Block-diagonal matrices cut computation for distributed audio separation
Fast Multichannel NMF with Block-Diagonal Spatial Covariance Matrices for Efficient Blind Source Separation Using Distributed Microphone Arrays

Hirotaka Nishikori +4
eess.AS 2026-05-18 reviewed

Geometry conditioning adapts speaker extraction to any mic array
Flexible Multi-Channel Target Speaker Extraction Using Geometry-Conditioned Spatially Selective Non-linear Filters

Jiatong Li +2
eess.AS 2026-05-18 reviewed

Streaming CTC spotting enables real-time keyword biasing in ASR
Contextual Biasing for Streaming ASR via CTC-based Word Spotting

Kai-Chen Tsai +3
eess.AS 2026-05-18 reviewed

Streaming CTC spotting reduces WER and lifts keyword F-score in live ASR
Contextual Biasing for Streaming ASR via CTC-based Word Spotting

Kai-Chen Tsai +3
eess.AS 2026-05-18 reviewed

TNKP cuts misadjustment in fractional subband filters for ANC
Fractional-Order Subband p-Norm Adaptive Filter via Transformation Nearest Kronecker Product Decomposition for Active Noise Control

Jianhong Ye +3
cs.MM 2026-05-18 reviewed

Two-phase sampling matches contradictory audio prompts to video
CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

Gyubin Lee +2
eess.AS 2026-05-18 reviewed

156-hour Urdu corpus supplies 12 paralinguistic labels
UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

Attia Nafees ul Haq +4
cs.CL 2026-05-18 reviewed

Distillation cuts error rates for Nigerian speech recognition by 29%
Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

Sewade Ogun
eess.AS 2026-05-17 reviewed

Per-class unreliability scalars boost audio tagging on weak labels
Robust Audio Tagging under Class-wise Supervision Unreliability

Yuanbo Hou +6
eess.AS 2026-05-17 reviewed

Projection heads align onomatopoeic images with sounds
Audio-Image Cross-Modal Retrieval with Onomatopoeic Images

Keisuke Imoto +2
cs.CL 2026-05-17 reviewed

ASR errors degrade Korean QA the same relative amount across LLMs
Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

Donghyuk Jung +1
eess.AS 2026-05-17 reviewed

402M model tops music accompaniment benchmarks
S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation

Huakang Chen +9
eess.AS 2026-05-17 reviewed

A single control filter optimized over multiple measured paths narrows performance…
Robust Soft-Constrained Spatially Selective Active Noise Control for Hearables Under Secondary Path Variations

Tong Xiao +3
eess.AS 2026-05-17 reviewed

Audio models confuse target speech with multilingual distractors
Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities

Heejoon Koo
cs.SD 2026-05-16 reviewed

Target-KL regularization sets exact bitrates for audio VAEs
Taming Audio VAEs via Target-KL Regularization

Prem Seetharaman +1
eess.AS 2026-05-16 reviewed

Alignment step fixes semantic drift in continuous speech synthesis
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

Huimeng Wang +9
eess.AS 2026-05-15 reviewed

Survey traces audio super-resolution shift to generative models
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

Ningyuan Yang +6
eess.AS 2026-05-15 reviewed

MedASR cuts medical dictation errors by 58%
MedASR: An Open-Source Model for High-Accuracy Medical Dictation

Ke Wu +4
eess.AS 2026-05-15 reviewed

Flow model restores speech in real time at 120 times lower compute
Real-time Speech Restoration using Data Prediction Mean Flows

Sebastian Braun
eess.AS 2026-05-15 reviewed

Augmentation and LLM fixes halve errors in oral cancer speech recognition
Improving Automatic Speech Recognition for Speakers Treated for Oral Cancer using Data Augmentation and LLM Error Correction

Hidde Folkertsma +6
eess.AS 2026-05-14 reviewed

Synthetic data nears real baselines for multi-talker speech tasks
Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization

Alexander Polok +5
cs.SD 2026-05-14 reviewed

SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

KiHyun Nam +4
eess.AS 2026-05-13 reviewed

Benchmark standardizes early Parkinson's speech detection
A Benchmark for Early-stage Parkinson's Disease Detection from Speech

Terry Yi Zhong +5

2 Piths
eess.AS 2026-05-13 reviewed

Framework filters FSD50K to single-source audio clips
FSD50K-Solo: Automated Curation of Single-Source Sound Events

Ningyuan Yang +6
eess.AS 2026-05-12 reviewed

SMC dataset exposes tempo bias in state-of-the-art beat tracking models
The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking

Jaehoon Ahn +2
cs.SD 2026-05-12 reviewed

STRUM turns raw audio into playable rhythm charts at 0.84 F1 for drums
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

Joshua Opria