archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 5

eess.AS 2026-04-14 reviewed

Speech synthesis hits 49 ms first-byte latency via block-wise decoding
An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Tianhui Su +4
eess.AS 2026-04-14 reviewed

Common word cues cut rare bias word errors by 16% in speech LLMs
Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

Sashi Novitasari +3
eess.AS 2026-04-14 reviewed

VoxEffects dataset supplies exact effect chains for speech audio
VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark

Zhe Zhang +2
eess.AS 2026-04-14 reviewed

Mamba predicts clean tokens to boost CI speech in noise
TokenSE: a Mamba-based discrete token speech enhancement framework for cochlear implants

Hsin-Tien Chiang +1
eess.AS 2026-04-13 reviewed

Pre-quantization fusion adds video to audio tokens without reconstruction loss
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

Xiangyu Zhang +5
eess.AS 2026-04-13 reviewed

Watermark survives normal edits but breaks on deepfakes
StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

Zhentao Liu +1
eess.AS 2026-04-13 reviewed

Audio AI models lose track of emotions in long talks
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

Shuiyuan Wang +7
eess.AS 2026-04-13 reviewed

LLM with cluster tags beats sequential diarization plus ASR
Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS

Hagai Aronowitz +4
eess.AS 2026-04-13 reviewed

Joint teacher-student updates cut speech WER by 4.6%
Teaching the Teachers: Boosting unsupervised domain adaptation in speech recognition by ensemble update

Rehan Ahmad +3
eess.AS 2026-04-13 reviewed

Neural estimator preserves direction in multichannel speech enhancement
Direction-Preserving MIMO Speech Enhancement Using a Neural Covariance Estimator

Thomas Deppisch
eess.SP 2026-04-13 reviewed

Deep learning ANC preserves speech while cutting non-stationary noise
Speech-preserving active noise control: a deep learning approach in reverberant environments

Shuning Dai
cs.SD 2026-04-13 reviewed

AF-Next outperforms similar open audio models on 20 benchmarks
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Sreyan Ghosh +17
cs.SD 2026-04-12 reviewed

Synthetic labels keep music-flavor structure intact
Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences

Matteo Spanio +2
cs.CL 2026-04-11 reviewed

Binary projection halves repetition in full-duplex speech models
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

Chi-Yuan Hsiao +5
cs.LG 2026-04-10 reviewed

Time-aware networks fix read bias in live speech translation
Regularized Entropy Information Adaptation with Temporal-Awareness Networks for Simultaneous Speech Translation

Joseph Liu +3
eess.AS 2026-04-10 reviewed

Self-control speech tasks sense student emotions
Toward using Speech to Sense Student Emotion in Remote Learning Environments

Sargam Vyas +5
eess.AS 2026-04-10 reviewed

Utterance filters pick reliable child ASR outputs at 97% precision
Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

Gus Lathouwers +3
eess.AS 2026-04-10 reviewed

Diverse broadcast audio pretraining boosts SSL models
Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts

Valentin Pelloin +3
eess.AS 2026-04-10 reviewed

Language model separates music stems via discrete tokens
Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models

Pengbo Lyu +6
cs.SD 2026-04-10 reviewed

Model turns mixed dialogue audio into separate speaker tracks
DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio

Wataru Nakata +4
eess.AS 2026-04-10 reviewed

Phoneme sequences outperform projectors in low-resource LLM ASR
Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR

Ziwei Li +4
eess.AS 2026-04-10 reviewed

Confidence weighting cuts medical ASR errors for Telugu and Kannada
Enhancing ASR Performance in the Medical Domain for Dravidian Languages

Sri Charan Devarakonda +5
eess.AS 2026-04-10 reviewed

Phonetic sync aligns dubbed audio to original lips
PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Changi Hong +6
cs.SD 2026-04-09 reviewed

ASR models output wrong scripts in 21% of multilingual cases
Script collapse in multilingual ASR: A reference-free metric and 100-pair benchmark

Hanif Rahman
eess.AS 2026-04-09 reviewed

Audio prompts plus online RL lift conversational TTS quality
Enhancing Conversational TTS with Cascaded Prompting and ICL-Based Online Reinforcement Learning

Zhicheng Ouyang +6
cs.SD 2026-04-09 reviewed

Front-end choice dominates deepfake audio detector performance
DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection

Yassine El Kheir +8
eess.AS 2026-04-09 reviewed

Ring mixing halves residual noise in unsupervised speech separation
Ring Mixing with Auxiliary Signal-to-Consistency-Error Ratio Loss for Unsupervised Denoising in Speech Separation

Matthew Maciejewski +1
cs.SD 2026-04-09 reviewed

Interaction history lifts device speech detection F1 to 0.95
Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

David Joohun Kim +3
eess.AS 2026-04-09 reviewed

TASU2 controls WER in CTC simulation for speech LLM adaptation
TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

Jing Peng +7
eess.AS 2026-04-09 reviewed

Gaze cues select target speaker in multi-talker enhancement
Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework

Hsiang-Cheng Yang +5
eess.AS 2026-04-09 reviewed

Entropy metrics guide efficient LLM speech recognition
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

Yuan Xie +6
cs.SD 2026-04-08 reviewed

Emotion recognition crosses languages with five source labels
Semantic-Emotional Resonance Embedding: A Semi-Supervised Paradigm for Cross-Lingual Speech Emotion Recognition

Ya Zhao +2
eess.AS 2026-04-08 reviewed

EvoTSE updates enrollment to cut confusion in speaker extraction
EvoTSE: Evolving Enrollment for Target Speaker Extraction

Zikai Liu +6
eess.AS 2026-04-08 reviewed

Attention module sharpens speech for cochlear implant users
DAT-CFTNet: Speech Enhancement for Cochlear Implant Recipients using Attention-based Dual-Path Recurrent Neural Network

Nursadul Mamun +1
eess.AS 2026-04-08 reviewed

Hierarchical loss lifts subtle fault detection in manufacturing
Deep Hierarchical Knowledge Loss for Fault Intensity Diagnosis

Yu Sha +10
eess.AS 2026-04-08 reviewed

One model learns both audio and speech traits via long-patch prediction
ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals

Ameenudeen P E +2
cs.CV 2026-04-07 reviewed

Residual CNN and BiGRU cut music score recognition error to 0.45%
A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions

Junwen Ma +3
eess.AS 2026-04-07 reviewed

Voice dataset launches AI challenge for early ALS detection
SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment

Giovanna Sannino +12
eess.AS 2026-04-07 reviewed

Challenge dataset lets AI detect ALS from voice recordings
SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment

Giovanna Sannino +12
eess.AS 2026-04-07 reviewed

Model turns low-order reflections into full room impulse responses
Multimodal Deep Learning Method for Real-Time Spatial Room Impulse Response Computing

Zhiyu Li +3
eess.AS 2026-04-07 reviewed

Open-ear glasses cancel noise using only frame mics
Active noise cancellation on open-ear smart glasses

Kuang Yuan +7
eess.AS 2026-04-06 reviewed

Diarization models drop on child and older adult speech
Exploring Speech Foundation Models for Speaker Diarization Across Lifespan

Anfeng Xu +2
eess.AS 2026-04-06 reviewed

Joint training on all ages fixes diarization drops on child and older voices
Exploring Speech Foundation Models for Speaker Diarization Across Lifespan

Anfeng Xu +2
eess.AS 2026-04-06 reviewed

New benchmark tests voice agents on real disfluent speech and tool chains
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Guan-Ting Lin +3
cs.SD 2026-04-06 reviewed

High-res audio plus subband experts beat 16 kHz detectors for singing fakes
Joint Fullband-Subband Modeling for High-Resolution SingFake Detection

Xuanjun Chen +5
cs.SD 2026-04-06 reviewed

Binaural attention lifts audio navigation success on unheard sounds
Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

Jia Li +1
cs.AR 2026-04-06 reviewed

Bit partitioning lets one PE run FP8 or dual FP4 with 60% less area
DHFP-PE: Dual-Precision Hybrid Floating Point Processing Element for AI Acceleration

Shubham Kumar +3
eess.AS 2026-04-04 reviewed

Zero-shot KWS reaches 90% accuracy with 0.007% false alarms
MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting

Lo-Ya Li +4
eess.AS 2026-04-03 reviewed

No enrollment needed: mixture yields usable speaker embeddings
Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction

FNU Sidharth +3
eess.AS 2026-04-03 reviewed

Iterative reasoning lifts speaker attribution accuracy in group talks
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Zhennan Lin +7