archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 5
-
Speech synthesis hits 49 ms first-byte latency via block-wise decoding
An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding
-
Common word cues cut rare bias word errors by 16% in speech LLMs
Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction
-
VoxEffects dataset supplies exact effect chains for speech audio
VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark
-
Mamba predicts clean tokens to boost CI speech in noise
TokenSE: a Mamba-based discrete token speech enhancement framework for cochlear implants
-
Pre-quantization fusion adds video to audio tokens without reconstruction loss
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
-
Watermark survives normal edits but breaks on deepfakes
StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection
-
Audio AI models lose track of emotions in long talks
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
-
LLM with cluster tags beats sequential diarization plus ASR
Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS
-
Joint teacher-student updates cut speech WER by 4.6%
Teaching the Teachers: Boosting unsupervised domain adaptation in speech recognition by ensemble update
-
Neural estimator preserves direction in multichannel speech enhancement
Direction-Preserving MIMO Speech Enhancement Using a Neural Covariance Estimator
-
Deep learning ANC preserves speech while cutting non-stationary noise
Speech-preserving active noise control: a deep learning approach in reverberant environments
-
AF-Next outperforms similar open audio models on 20 benchmarks
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
-
Synthetic labels keep music-flavor structure intact
Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences
-
Binary projection halves repetition in full-duplex speech models
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
-
Time-aware networks fix read bias in live speech translation
Regularized Entropy Information Adaptation with Temporal-Awareness Networks for Simultaneous Speech Translation
-
Self-control speech tasks sense student emotions
Toward using Speech to Sense Student Emotion in Remote Learning Environments
-
Utterance filters pick reliable child ASR outputs at 97% precision
Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech
-
Diverse broadcast audio pretraining boosts SSL models
Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts
-
Language model separates music stems via discrete tokens
Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models
-
Model turns mixed dialogue audio into separate speaker tracks
DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio
-
Phoneme sequences outperform projectors in low-resource LLM ASR
Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
-
Confidence weighting cuts medical ASR errors for Telugu and Kannada
Enhancing ASR Performance in the Medical Domain for Dravidian Languages
-
Phonetic sync aligns dubbed audio to original lips
PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
-
ASR models output wrong scripts in 21% of multilingual cases
Script collapse in multilingual ASR: A reference-free metric and 100-pair benchmark
-
Audio prompts plus online RL lift conversational TTS quality
Enhancing Conversational TTS with Cascaded Prompting and ICL-Based Online Reinforcement Learning
-
Front-end choice dominates deepfake audio detector performance
DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection
-
Ring mixing halves residual noise in unsupervised speech separation
Ring Mixing with Auxiliary Signal-to-Consistency-Error Ratio Loss for Unsupervised Denoising in Speech Separation
-
Interaction history lifts device speech detection F1 to 0.95
Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI
-
TASU2 controls WER in CTC simulation for speech LLM adaptation
TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
-
Gaze cues select target speaker in multi-talker enhancement
Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework
-
Entropy metrics guide efficient LLM speech recognition
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
-
Emotion recognition crosses languages with five source labels
Semantic-Emotional Resonance Embedding: A Semi-Supervised Paradigm for Cross-Lingual Speech Emotion Recognition
-
EvoTSE updates enrollment to cut confusion in speaker extraction
EvoTSE: Evolving Enrollment for Target Speaker Extraction
-
Attention module sharpens speech for cochlear implant users
DAT-CFTNet: Speech Enhancement for Cochlear Implant Recipients using Attention-based Dual-Path Recurrent Neural Network
-
Hierarchical loss lifts subtle fault detection in manufacturing
Deep Hierarchical Knowledge Loss for Fault Intensity Diagnosis
-
One model learns both audio and speech traits via long-patch prediction
ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals
-
Residual CNN and BiGRU cut music score recognition error to 0.45%
A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions
-
Voice dataset launches AI challenge for early ALS detection
SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment
-
Challenge dataset lets AI detect ALS from voice recordings
SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment
-
Model turns low-order reflections into full room impulse responses
Multimodal Deep Learning Method for Real-Time Spatial Room Impulse Response Computing
-
Open-ear glasses cancel noise using only frame mics
Active noise cancellation on open-ear smart glasses
-
Diarization models drop on child and older adult speech
Exploring Speech Foundation Models for Speaker Diarization Across Lifespan
-
Joint training on all ages fixes diarization drops on child and older voices
Exploring Speech Foundation Models for Speaker Diarization Across Lifespan
-
New benchmark tests voice agents on real disfluent speech and tool chains
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
-
High-res audio plus subband experts beat 16 kHz detectors for singing fakes
Joint Fullband-Subband Modeling for High-Resolution SingFake Detection
-
Binaural attention lifts audio navigation success on unheard sounds
Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction
-
Bit partitioning lets one PE run FP8 or dual FP4 with 60% less area
DHFP-PE: Dual-Precision Hybrid Floating Point Processing Element for AI Acceleration
-
Zero-shot KWS reaches 90% accuracy with 0.007% false alarms
MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting
-
No enrollment needed: mixture yields usable speaker embeddings
Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction
-
Iterative reasoning lifts speaker attribution accuracy in group talks
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR