archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 4

cs.LG 2026-04-23 reviewed

One dilated CNN plus resampling matches AR denoising for periodic signals
Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

Eli Gildish +2
eess.AS 2026-04-23 reviewed

Tutorial splits top open-source speaker diarization into seven stages
DiariZen Explained: A Tutorial for the Open Source State-of-the-Art Speaker Diarization Pipeline

Nikhil Raghav
eess.AS 2026-04-23 reviewed

New benchmark tests AI on handling speech overlaps and interruptions
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

Chengyou Wang +8
cs.SD 2026-04-22 reviewed

Benchmark reveals AI music models perceive notation but miss theory
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

Menghe Ma +7
eess.AS 2026-04-22 reviewed

MERT metrics better match human ratings for music source separation
Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations

Paul A. Bereuter +1
q-bio.NC 2026-04-22 reviewed

Decorrelation reduces brain-to-text WER from 26.3% to 21.6%
MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis

Yuanhao Chen +1
math.CO 2026-04-21 reviewed

Diatonic seventh chords form a Fano configuration
Tonnetz Theory, Classical Harmony, and the Combinatorial Geometry of Abstract Musical Resources

Jeffrey R. Boland +1
eess.AS 2026-04-21 reviewed

Hyperbolic fusion spots Indic codec deepfakes
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

Girish +3
eess.AS 2026-04-21 reviewed

Cascaded temporal stages yield natural TTS with fewer parameters
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

Jianbo Ma +1
eess.AS 2026-04-21 reviewed

Voice range tracks TTS capability while CPPs separate natural from robotic speech
Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment

Huanchen Cai +1
cs.AI 2026-04-21 reviewed

One LLM replaces VAD, ASR and interruption detection for live speech
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

Yadong Li +3
cs.CL 2026-04-21 reviewed

Unscripted phone calls form new benchmark for Indian speech recognition
Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Kaushal Bhogale +13
eess.AS 2026-04-21 reviewed

Consistency regularization unifies offline and streaming RNNT ASR
Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Andrei Andrusenko +5
eess.AS 2026-04-21 reviewed

Photoelectric servo cuts mic self-noise to 11 dBA
Self-Noise Reduction for Capacitive Sensors via Photoelectric DC Servo: Application to Condenser Microphones

Hirotaka Obo +3
eess.SP 2026-04-20 reviewed

Covariance reconstruction enables practical hybrid SMI
Hybrid SMI Realization via Matrix Completion and Riemannian Manifold Optimization on Narrowband Sub-Array Based Architectures

Tarun Suman Cousik +5
cs.SD 2026-04-20 reviewed

Rule-based alignment cuts rule violations in lyric-to-melody generation
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

Hao Meng +4
eess.AS 2026-04-20 reviewed

Kernel plasticity lifts Hebbian audio learning to 76.3% accuracy
Incremental learning for audio classification with Hebbian Deep Neural Networks

Riccardo Casciotti +3
eess.AS 2026-04-20 reviewed

2.3B LLM-based ASR outperforms larger models
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

Yuan Xie +11
eess.AS 2026-04-20 reviewed

Benchmark shows TTS systems lag on complex instructions
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

Huakang Chen +14
eess.AS 2026-04-19 reviewed

Non-verbal cues supervise speech emotion recognition across languages
Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition

Girish +2
eess.AS 2026-04-19 reviewed

Hyperbolic model detects codec deepfakes in diseased voices
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

Mohd Mujtaba Akhtar +2
cs.CL 2026-04-19 reviewed

Translation system keeps laughter and tears in speech
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Szu-Chi Chen +4
eess.AS 2026-04-19 reviewed

Audio models show bigger bias from gender than from accents
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

Yi-Cheng Lin +3
eess.AS 2026-04-18 reviewed

Anonymized speech trains AI models nearly as well as raw recordings
Anonymization, Not Elimination: Utility-Preserved Speech Anonymization

Yunchong Xiao +6
eess.AS 2026-04-18 reviewed

Room acoustics recast as state-space model of boundary integral equation
A state-space representation of the boundary integral equation for room acoustic modelling

Randall Ali +4
cs.SD 2026-04-17 reviewed

Pairwise audio comparisons lift deepfake detection up to 2x on wild data
ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection

Benjamin Chou +2
eess.AS 2026-04-17 reviewed

Neural-only detection risks falling short against future fake speech
Neural Encoding Detection is Not All You Need for Synthetic Speech Detection

Luca Cuccovillo +3
cs.SD 2026-04-17 reviewed

Compact network spots AI music via codec artifacts
ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

Heewon Oh
cs.SD 2026-04-17 reviewed

Benchmark reveals speech AI limits on complex tool calls
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

Ramit Pahwa +6
cs.CL 2026-04-17 reviewed

Qwen3.5-Omni claims SOTA on 215 audio-visual tasks
Qwen3.5-Omni Technical Report

Qwen Team
cs.SD 2026-04-16 reviewed

Manual protocol measures bar-level tempo in historical chamber music
A Manual Bar-by-Bar Tempo Measurement Protocol for Polyphonic Chamber Music Recordings: Design, Validation, and Application to Beethoven's Piano and Cello Sonatas

Ignasi Sole
cs.SD 2026-04-16 reviewed

LSTM with MFCC features reaches 99% accuracy on speech emotions
Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model

Adelekun Oluwademilade +9
cs.SD 2026-04-16 reviewed

RL fine-tuning cuts speech WER to 3.2% at 200bps
ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

Junyi Wang +6
cs.SD 2026-04-16 reviewed

Acoustic features cut recall from 66% to 47% in volatility forecasts
The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction

Dhruvin Dungrani +1
eess.AS 2026-04-16 reviewed

SongBench rates AI songs on seven expert dimensions
SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment

Dapeng Wu +7
eess.AS 2026-04-16 reviewed

UniPASE tops challenge by restoring clean phonetic content first
UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations

Xiaobin Rong +4
cs.SD 2026-04-16 reviewed

SLMs spot norms in text yet ignore them when spoken
VoxSafeBench: Not Just What Is Said, but Who, How, and Where

Yuxiang Wang +11
eess.AS 2026-04-15 reviewed

Speaker overlap boosts speech depression detection accuracy
Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection

Hsiang-Chen Yeh +5
eess.AS 2026-04-15 reviewed

Speaker ID errors drop 93% with enhanced open-set tuning
SpeakerRPL v2: Robust Open-set Speaker Identification through Enhanced Few-shot Foundation Tuning and Model Fusion

Zhiyong Chen +4
eess.AS 2026-04-15 reviewed

LLM meta-evaluator beats speech quality predictors with few labels
Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models

Ryandhimas E. Zezario +5
eess.AS 2026-04-15 reviewed

RBF SVM detects deepfake audio at 93 percent accuracy
Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset

Faheem Ahmad +2
eess.AS 2026-04-14 reviewed

Adapted speech LLMs predict word timestamps and lift ASR accuracy
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

Xulin Fan +5
eess.AS 2026-04-14 reviewed

Prosody pretraining halves error on emotional speech deepfakes
ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

Aurosweta Mahapatra +4
cs.CL 2026-04-14 reviewed

Async retrieval gives full-duplex speech models non-duplex factuality
MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Chung-Ming Chien +5
cs.CL 2026-04-14 reviewed

Async retrieval matches non-duplex factuality in full-duplex speech models
MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Chung-Ming Chien +5
eess.AS 2026-04-14 reviewed

Waveguides make real-time physical sound modeling practical
Four Decades of Digital Waveguides

Pablo Tablas de Paula +3
eess.AS 2026-04-14 reviewed

Audio model gains step-by-step reasoning from 545k curated samples
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

Longhao Li +7
eess.AS 2026-04-14 reviewed

One-step codec latent conversion enables streaming zero-shot VC
X-VC: Zero-shot Streaming Voice Conversion in Codec Space

Qixi Zheng +9
eess.AS 2026-04-14 reviewed

Circular mic array lets UAVs detect victims by sound
Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System

Yi Hong +4
eess.AS 2026-04-14 reviewed

Delayed secondary speaker corrects both timbre and space
Room compensation for loudspeaker reproduction using a supporting source

James Brooks-Park +3