archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 8

eess.AS 2025-10-07 reviewed

Discrete tokens close the ASR-TTS loop and cut error rates
TokenChain: A Discrete Speech Chain via Semantic Token Modeling

Mingxuan Wang +1
eess.AS 2025-10-06 reviewed

Pruned Whisper keeps 90 percent accuracy at 48 percent smaller size
BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

Yaya Sy +2
eess.AS 2025-10-03 reviewed

Source embedding scales music structure analysis to messy labels
SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao +7
eess.AS 2025-10-02 reviewed

Matrix method localizes multiple sound sources with one variable per source
Multi-Source Position and Direction-of-Arrival Estimation Based on Euclidean Distance Matrices

Klaus Br\"umann +1
cs.SD 2025-10-02 reviewed

System changes live audio effects from performer emotions
Go witheFlow: Real-time Emotion Driven Audio Effects Modulation

Edmund Dervakos +4
cs.MM 2025-09-30 reviewed

Twin DiT modules generate synced audio and video in one pass
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Chetwin Low +2
eess.AS 2025-09-30 reviewed

Spoken AI models falter on timing and sync tasks
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Kai-Wei Chang +9
eess.AS 2025-09-29 reviewed

Semantic tokens guide flow matching to enhance speech faithfully
SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

Xingchen Li +6
eess.AS 2025-09-29 reviewed

Zero-shot cosine outperforms few-shot in OOD deepfake tracing
Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing

Manasi Chhibber +2
cs.SD 2025-09-27 reviewed

TV dialogue dataset raises voice role-play scores 38 percent
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

Wenyu Li +4
cs.SD 2025-09-26 reviewed

VLM generates music from images with no training or fine-tuning
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

Zijian Zhao +2
eess.AS 2025-09-26 reviewed

Models bias continuations toward modal voice more for female prompts
Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

Shree Harsha Bokkahalli Satish +4
eess.AS 2025-09-25 reviewed

Adversarial noise tricks speech enhancers into new meanings
Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Rostislav Makarov +2
eess.AS 2025-09-23 reviewed

Source count fusion lifts DOA F1-scores by 14% in hearing aids
Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion

Farnaz Jazaeri +3
cs.SD 2025-09-22 reviewed

Model turns video into object-aware stereo sound
StereoFoley: Object-Aware Stereo Audio Generation from Video

Tornike Karchkhadze +6
cs.CL 2025-09-22 reviewed

One model hits SOTA on text, image, audio and video
Qwen3-Omni Technical Report

Jin Xu +37
eess.AS 2025-09-22 reviewed

Comparator loss yields speech severity scores for disease tracking
Comparator Loss: An Ordinal Contrastive Loss to Derive a Severity Score for Speech-based Health Monitoring

Jacob J Webber +7
eess.AS 2025-09-21 reviewed

Single model separates and locates multiple overlapping sounds
DeepASA: An Object-Oriented Multi-Purpose Network for Auditory Scene Analysis

Dongheon Lee +2
eess.AS 2025-09-19 reviewed

Visual cues first cluster speech features in AVSR models
Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Aristeidis Papadopoulos +1
cs.SD 2025-09-19 reviewed

Differentiable method boosts acoustic modeling from limited data
Differentiable Acoustic Radiance Transfer

Sungho Lee +5
cs.SD 2025-09-19 reviewed

Speaker-aware model yields more realistic conversation timings
From Independence to Interaction: Speaker-Aware Simulation of Multi-Speaker Conversational Timing

M\'at\'e Gedeon +1
cs.SD 2025-09-19 reviewed

1% data addition activates simultaneous translation in audio models
Direct Simultaneous Translation Activation for Large Audio-Language Models

Pei Zhang +6
cs.SD 2025-09-18 reviewed

Thai encoder and data pipeline enable multitask speech AI
Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

Mingchen Shao +6
eess.AS 2025-09-17 reviewed

TTS systems default to adult voices despite child or elderly instructions
Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

Yi-Cheng Lin +4
cs.SD 2025-09-16 reviewed

Mixture of experts blends binaural filters for moving talkers
Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

Manan Mittal +6
eess.AS 2025-09-10 reviewed

Expert routing boosts noisy emotion recognition 12 percent
Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition

Jing-Tong Tzeng +2
cs.SD 2025-09-09 reviewed

Audio LLM toolkit delivers up to 151 percent faster evaluations
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Hoang Nguyen +11
cs.SD 2025-09-07 reviewed

Diffusion model adds personalized events from few reference audios
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan +7
eess.AS 2025-09-04 reviewed

Audiobook quotes with verb labels boost TTS expressivity
Computational Narrative Understanding for Expressive Text-to-Speech

Gaspard Michel +2
eess.AS 2025-08-28 reviewed

Shared encoder beats separate models on diarization and separation
Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder

Muhammad Shakeel +4
cs.CL 2025-08-25 reviewed

Self-generated data aligns speech LLMs to instructions better
Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Yansong Liu +2
cs.CL 2025-08-24 reviewed

Adapted open LLMs match commercial dementia screening tools
Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

Fatemeh Taherinezhad +8
eess.AS 2025-08-20 reviewed

Physics-aware kernels recover steering vectors from 10x fewer measurements
Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening

Diego Di Carlo (RIKEN AIP) +6
cs.CL 2025-08-17 reviewed

Fine-tuned causal Whisper outperforms streaming baselines under 300ms
WhisperRT -- Turning Whisper into a Causal Streaming Model

Tomer Krichli +2
eess.AS 2025-08-10 reviewed

ASR fixes grouped into five classes with shared metrics
Non-Intrusive Automatic Speech Recognition Refinement: A Survey

Mohammad Reza Peyghan +5
cs.SD 2025-08-05 reviewed

One model restores and masters music from text instructions
SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

Jan Melechovsky +3
eess.AS 2025-07-31 reviewed

Benchmark with multi-expert captions tests detailed audio understanding
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Yadong Niu +9
eess.AS 2025-07-30 reviewed

New benchmark finds two strategies in full-duplex speech models
Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

Guan-Ting Lin +6
cs.CL 2025-07-22 reviewed

Step-Audio 2 sets SOTA on audio understanding and conversation
Step-Audio 2 Technical Report

Boyong Wu +108
cs.SD 2025-07-21 reviewed

Chaos discriminators set new SOTA for speech bandwidth extension
CIS-BWE: Chaos-Informed Speech Bandwidth Extension

Tarikul Islam Tamiti +3
cs.CL 2025-07-17 reviewed

Pipeline adds stress and punctuation to Russian speech data
Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

Kirill Borodin +5
eess.AS 2025-07-16 reviewed

Text transcriptions boost attacks on voice anonymization
VoxATtack: A Multimodal Attack on Voice Anonymization Systems

Ahmad Aloradi +3
cs.SD 2025-07-16 reviewed

AI model builds 3D ocean sound maps from surface data alone
A Multimodal Data Fusion Attention-Empowered Generative Adversarial Network for Real Time 3D Underwater Sound Speed Field Construction

Wei Huang +6
eess.AS 2025-07-12 reviewed

Flow-matching model generates spoken dialogues faster
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

Han Zhu +13
cs.SD 2025-07-10 reviewed

Open audio model tops 20+ benchmarks with public data only
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel +10
cs.SD 2025-07-09 reviewed

Coupled quadratic program cuts distortion in multichannel mixers
Constraint Optimized Multichannel Mixer-limiter Design

Yuancheng Luo +2
eess.AS 2025-07-08 reviewed

One-step model beats 30-step teacher in speech enhancement
Robust One-step Speech Enhancement via Consistency Distillation

Liang Xu +2
cs.CV 2025-06-30 reviewed

Unified model generates speech and facial motion together
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon +4
cs.SD 2025-06-17 reviewed

Scattered sound classifies hair type and moisture at nearly 90 percent
Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

Long-Vu Hoang +2
cs.SD 2025-06-16 reviewed

Calibrated distillation lifts low-complexity speech enhancement
Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

Jiaming Cheng +8