archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 1
-
EMA and dual scoring produce TTS hardest to detect in WildSpoof
Natural Yet Challenging to Detect: Robust In-the-Wild TTS through EMA and Dual-Scoring Prompt Selection -- Submission for WildSpoof 2026 TTS Track
-
Frame-aligned fusion of two encoders cuts error in hearing-aid intelligibility prediction
Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech
-
Acoustic fusion raises intelligibility correlation to 0.806
Word-Level Modeling with Alignment-Aware Acoustic Fusion for Text-Assisted Intelligibility Prediction in Listeners with Hearing Loss
-
Two-stage training matches full phoneme scoring with few labels
A study on weakly-supervised training approaches for phoneme-level pronunciation scoring
-
One model tops benchmarks in speech recognition
StepAudio 2.5 Technical Report
-
Integrated gradients localize sound events at 0.39 IoU
Evaluating the Temporal Detection Capability of Integrated Gradients Applied on Sound Classifier
-
One model judges speech across many tasks with reasoning
UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment
-
Plug-in losses approximate EDL objectives with decaying error
Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier
-
LLM analysis outperforms acoustics for political pathos
Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
-
Audio denoiser infers scene to keep relevant sounds
Automatic Contextual Audio Denoising
-
Dual-stage phoneme search raises user keyword spotting to 97.85% AUC
Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation
-
Augmentations reduce TTS word error rate from 1.44 to 1.38
RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching
-
Neighbor consistency cuts sound zone variation by over 50 percent
Neighbor-Consistent Neural Filters for Robust Personal Sound Zones Under Localization Uncertainty
-
Embeddings cluster speech degradations for better detection
Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals
-
Neural beamformer outperforms LCMV by learning constrained weights
Linearly Constrained Deep Beamformer for Multi-Speaker Scenarios
-
Survey unifies audio reasoning approaches in foundation models
A Survey of Audio Reasoning in Multimodal Foundation Models
-
Neural net predicts room acoustics from geometry and materials
From Numbers to Perception, Energy Decay Curves Prediction
-
Tropical bird detector trained on 50k-clip dataset hits 99.57% accuracy
SEABAD: A Tropical Bird Activity Detection Dataset for Passive Acoustic Monitoring
-
Public speech data powers TTS models matching closed systems
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
-
Full-duplex model speaks and acts on the same 160 ms clock
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
-
Planning step and targeted retrieval stabilize accuracy on longer audio
PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding
-
Causal estimator improves short-window sound field reconstruction
Causal Spatio-Temporal Sound Field Reconstruction
-
Scaled simulations cut speech recognition errors over 30 percent
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
-
Cross-talk reduction on close-talk mics yields SOTA far-field separation
Cross-Talk Speech Reduction, by Separation, for Separation
-
Block-diagonal matrices cut computation for distributed audio separation
Fast Multichannel NMF with Block-Diagonal Spatial Covariance Matrices for Efficient Blind Source Separation Using Distributed Microphone Arrays
-
Geometry conditioning adapts speaker extraction to any mic array
Flexible Multi-Channel Target Speaker Extraction Using Geometry-Conditioned Spatially Selective Non-linear Filters
-
Streaming CTC spotting enables real-time keyword biasing in ASR
Contextual Biasing for Streaming ASR via CTC-based Word Spotting
-
Streaming CTC spotting reduces WER and lifts keyword F-score in live ASR
Contextual Biasing for Streaming ASR via CTC-based Word Spotting
-
TNKP cuts misadjustment in fractional subband filters for ANC
Fractional-Order Subband p-Norm Adaptive Filter via Transformation Nearest Kronecker Product Decomposition for Active Noise Control
-
Two-phase sampling matches contradictory audio prompts to video
CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation
-
156-hour Urdu corpus supplies 12 paralinguistic labels
UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations
-
Distillation cuts error rates for Nigerian speech recognition by 29%
Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation
-
Per-class unreliability scalars boost audio tagging on weak labels
Robust Audio Tagging under Class-wise Supervision Unreliability
-
Projection heads align onomatopoeic images with sounds
Audio-Image Cross-Modal Retrieval with Onomatopoeic Images
-
ASR errors degrade Korean QA the same relative amount across LLMs
Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades
-
402M model tops music accompaniment benchmarks
S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation
-
A single control filter optimized over multiple measured paths narrows performance…
Robust Soft-Constrained Spatially Selective Active Noise Control for Hearables Under Secondary Path Variations
-
Audio models confuse target speech with multilingual distractors
Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities
-
Target-KL regularization sets exact bitrates for audio VAEs
Taming Audio VAEs via Target-KL Regularization
-
Alignment step fixes semantic drift in continuous speech synthesis
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis
-
Survey traces audio super-resolution shift to generative models
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
-
MedASR cuts medical dictation errors by 58%
MedASR: An Open-Source Model for High-Accuracy Medical Dictation
-
Flow model restores speech in real time at 120 times lower compute
Real-time Speech Restoration using Data Prediction Mean Flows
-
Augmentation and LLM fixes halve errors in oral cancer speech recognition
Improving Automatic Speech Recognition for Speakers Treated for Oral Cancer using Data Augmentation and LLM Error Correction
-
Synthetic data nears real baselines for multi-talker speech tasks
Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization
-
SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
-
Benchmark standardizes early Parkinson's speech detection
A Benchmark for Early-stage Parkinson's Disease Detection from Speech
2 Piths -
Framework filters FSD50K to single-source audio clips
FSD50K-Solo: Automated Curation of Single-Source Sound Events
-
SMC dataset exposes tempo bias in state-of-the-art beat tracking models
The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking
-
STRUM turns raw audio into playable rhythm charts at 0.84 F1 for drums
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts