archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 6
-
Dual-branch graphs disentangle features for emotion recognition
Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition
-
FastTurn detects turns faster by mixing early semantics with sound
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
-
RAVN is a navigation system for robots that uses audio signals to estimate how reliable…
Reliability-Aware Geometric Fusion for Robust Audio-Visual Navigation
-
Spatial descriptors cut steps in audio-visual navigation
Spatial-Aware Conditioned Fusion for Audio-Visual Navigation
-
Spatial fusion lifts audio-visual navigation on unheard sounds
Audio Spatially-Guided Fusion for Audio-Visual Navigation
-
PhiNet matches black-box speaker verification with phonetic explanations
PhiNet: Speaker Verification with Phonetic Interpretability
-
Speech depression detector generalizes across languages and matches EEG markers
Validating Computational Markers of Depressive Behavior: Cross-Linguistic Speech-Based Depression Detection with Neurophysiological Validation
-
Diffusion U-Net matches vocal separation baselines
Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation
-
Zero-shot TTS reaches 600 languages with direct text-to-acoustic mapping
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
-
Small models master Arabic speech through compressed distillation
HARNESS: Lightweight Distilled Arabic Speech Foundation Models
-
Asymmetric decoder refines speech separation with TF correlations
Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation
-
0.28 F1 jump in Arabic mispronunciation detection
IQRA 2026: Interspeech Challenge on Automatic Pronunciation Assessment for Modern Standard Arabic (MSA)
-
Hierarchical model predicts human ratings of AI-dubbed video at PCC > 0.75
Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?
-
KoALa-Bench tests LALMs on Korean speech understanding and faithfulness
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
-
Fairness model quantifies each demographic's contribution to SER bias
Explainable Speech Emotion Recognition: Weighted Attribute Fairness to Model Demographic Contributions to Social Bias
-
Diffusion model changes song lyrics while preserving melody
YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
-
Lightning V2 achieves 4x lower TTS cost on Tenstorrent vs L40S
Rewriting TTS Inference Economics: Lightning V2 on Tenstorrent Achieves 4x Lower Cost Than NVIDIA L40S
-
Continuous models needed to cut uncertainty in emotion AI
Modelling Emotions is an Elusive Pursuit in Affective Computing
-
TiCo cuts spoken response duration error by 2.7 times
TiCo: Time-Controllable Spoken Dialogue Model
-
Hierarchical labels turn text into a wide-band control channel for long speech synthesis
Borderless Long Speech Synthesis
-
Dialogue models reason internally while listening
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
-
Dialogue models gain silent thinking via recursive latent updates
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
-
Neural models score TTS quality better than human raters
Neural networks for Text-to-Speech evaluation
-
AI model mixes live music with zero latency
AILive Mixer: A Deep Learning based Zero Latency Automatic Music Mixer for Live Music Performances
-
Pseudo-labels and contrastive pretraining reach 0.761 SRCC on unseen dysarthric speech
Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech
-
Tight integration beats shallow fusion for LLMs in speech recognition
LLMs and Speech: Integration vs. Combination
-
Reward model judges spoken dialogues on prosody and natural phrasing
SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
-
Harf-Speech matches expert Arabic speech scores at 0.79 correlation
Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment
-
Non-iterative dMWF matches centralized Wiener filter
Distributed Multichannel Wiener Filtering for Wireless Acoustic Sensor Networks
-
Text-to-audio model generates room impulse responses
Adapting a Text-to-Audio Model for Room Impulse Response Generation
-
Spoof detectors guide hierarchical decoding for cleaner speech synthesis
Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection
-
SSL speech models put pitch and gender in first principal dimension
Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features
-
Spoof detectors vary sharply across 66 languages
When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus
-
Cross-ASR disagreement flags risky medical transcript segments
From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation
-
Large speech models outperform others at detecting audio deepfakes
A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection
-
LoRa enables secure 1.5 km peer-to-peer voice links
Modeling and Link Budget Feasibility Analysis of Secure LoRa-Based Peer-to-Peer Communication for Short-Range Tactical Networks
-
LMU and entropy fusion lift infant cry classification across domains
LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification
-
MIDI plus structure labels keep long songs coherent
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline
-
Speech models perform phonological vector arithmetic
[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic
-
Single mic estimates sound speed during playback
Online Single-Channel Audio-Based Sound Speed Estimation for Robust Multi-Channel Audio Control
-
Acoustic maps from beamforming detect voice replays
Multi-Channel Replay Speech Detection using Acoustic Maps
-
Machine identity knowledge conceals ASD weaknesses
How Much Does Machine Identity Matter in Anomalous Sound Detection at Test Time?
-
LLM passes cut diarization errors in French clinical speech
Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization
-
Noise augmentation tops data strategies for Parkinson's speech enhancement
Data Augmentation for Pathological Speech Enhancement
-
Branch analysis uncovers flawed specialization in anti-spoofing
Interpreting Multi-Branch Anti-Spoofing Architectures: Correlating Internal Strategy with Empirical Performance
-
Speech enhancement pruning masks predict VAD and pitch at 93 percent accuracy
From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks
-
EEG-to-text model stops hallucinating by grounding every token in brain signals
Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding
-
Wavelet scattering features boost speech deepfake detection
WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection
-
Transformer reconstructs room impulse responses from sparse mics
RIR-Former: Coordinate-Guided Transformer for Continuous Reconstruction of Room Impulse Responses
-
Audio foundation models integrate core tasks into signal processing classes
Generative AI in Signal Processing Education: An Audio Foundation Model Based Approach