archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 4
-
One dilated CNN plus resampling matches AR denoising for periodic signals
Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach
-
Tutorial splits top open-source speaker diarization into seven stages
DiariZen Explained: A Tutorial for the Open Source State-of-the-Art Speaker Diarization Pipeline
-
New benchmark tests AI on handling speech overlaps and interruptions
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
-
Benchmark reveals AI music models perceive notation but miss theory
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
-
MERT metrics better match human ratings for music source separation
Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations
-
Decorrelation reduces brain-to-text WER from 26.3% to 21.6%
MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis
-
Diatonic seventh chords form a Fano configuration
Tonnetz Theory, Classical Harmony, and the Combinatorial Geometry of Abstract Musical Resources
-
Hyperbolic fusion spots Indic codec deepfakes
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
-
Cascaded temporal stages yield natural TTS with fewer parameters
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
-
Voice range tracks TTS capability while CPPs separate natural from robotic speech
Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
-
One LLM replaces VAD, ASR and interruption detection for live speech
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
-
Unscripted phone calls form new benchmark for Indian speech recognition
Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India
-
Consistency regularization unifies offline and streaming RNNT ASR
Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization
-
Photoelectric servo cuts mic self-noise to 11 dBA
Self-Noise Reduction for Capacitive Sensors via Photoelectric DC Servo: Application to Condenser Microphones
-
Covariance reconstruction enables practical hybrid SMI
Hybrid SMI Realization via Matrix Completion and Riemannian Manifold Optimization on Narrowband Sub-Array Based Architectures
-
Rule-based alignment cuts rule violations in lyric-to-melody generation
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints
-
Kernel plasticity lifts Hebbian audio learning to 76.3% accuracy
Incremental learning for audio classification with Hebbian Deep Neural Networks
-
2.3B LLM-based ASR outperforms larger models
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
-
Benchmark shows TTS systems lag on complex instructions
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
-
Non-verbal cues supervise speech emotion recognition across languages
Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition
-
Hyperbolic model detects codec deepfakes in diseased voices
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare
-
Translation system keeps laughter and tears in speech
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
-
Audio models show bigger bias from gender than from accents
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
-
Anonymized speech trains AI models nearly as well as raw recordings
Anonymization, Not Elimination: Utility-Preserved Speech Anonymization
-
Room acoustics recast as state-space model of boundary integral equation
A state-space representation of the boundary integral equation for room acoustic modelling
-
Pairwise audio comparisons lift deepfake detection up to 2x on wild data
ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection
-
Neural-only detection risks falling short against future fake speech
Neural Encoding Detection is Not All You Need for Synthetic Speech Detection
-
Compact network spots AI music via codec artifacts
ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics
-
Benchmark reveals speech AI limits on complex tool calls
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
-
Qwen3.5-Omni claims SOTA on 215 audio-visual tasks
Qwen3.5-Omni Technical Report
-
Manual protocol measures bar-level tempo in historical chamber music
A Manual Bar-by-Bar Tempo Measurement Protocol for Polyphonic Chamber Music Recordings: Design, Validation, and Application to Beethoven's Piano and Cello Sonatas
-
LSTM with MFCC features reaches 99% accuracy on speech emotions
Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model
-
RL fine-tuning cuts speech WER to 3.2% at 200bps
ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning
-
Acoustic features cut recall from 66% to 47% in volatility forecasts
The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction
-
SongBench rates AI songs on seven expert dimensions
SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment
-
UniPASE tops challenge by restoring clean phonetic content first
UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations
-
SLMs spot norms in text yet ignore them when spoken
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
-
Speaker overlap boosts speech depression detection accuracy
Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection
-
Speaker ID errors drop 93% with enhanced open-set tuning
SpeakerRPL v2: Robust Open-set Speaker Identification through Enhanced Few-shot Foundation Tuning and Model Fusion
-
LLM meta-evaluator beats speech quality predictors with few labels
Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models
-
RBF SVM detects deepfake audio at 93 percent accuracy
Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset
-
Adapted speech LLMs predict word timestamps and lift ASR accuracy
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
-
Prosody pretraining halves error on emotional speech deepfakes
ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks
-
Async retrieval gives full-duplex speech models non-duplex factuality
MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models
-
Async retrieval matches non-duplex factuality in full-duplex speech models
MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models
-
Waveguides make real-time physical sound modeling practical
Four Decades of Digital Waveguides
-
Audio model gains step-by-step reasoning from 545k curated samples
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
-
One-step codec latent conversion enables streaming zero-shot VC
X-VC: Zero-shot Streaming Voice Conversion in Codec Space
-
Circular mic array lets UAVs detect victims by sound
Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System
-
Delayed secondary speaker corrects both timbre and space
Room compensation for loudspeaker reproduction using a supporting source