archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 8
-
Discrete tokens close the ASR-TTS loop and cut error rates
TokenChain: A Discrete Speech Chain via Semantic Token Modeling
-
Pruned Whisper keeps 90 percent accuracy at 48 percent smaller size
BaldWhisper: Faster Whisper with Head Shearing and Layer Merging
-
Source embedding scales music structure analysis to messy labels
SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
-
Matrix method localizes multiple sound sources with one variable per source
Multi-Source Position and Direction-of-Arrival Estimation Based on Euclidean Distance Matrices
-
System changes live audio effects from performer emotions
Go witheFlow: Real-time Emotion Driven Audio Effects Modulation
-
Twin DiT modules generate synced audio and video in one pass
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
-
Spoken AI models falter on timing and sync tasks
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
-
Semantic tokens guide flow matching to enhance speech faithfully
SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement
-
Zero-shot cosine outperforms few-shot in OOD deepfake tracing
Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing
-
TV dialogue dataset raises voice role-play scores 38 percent
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
-
VLM generates music from images with no training or fine-tuning
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
-
Models bias continuations toward modal voice more for female prompts
Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias
-
Adversarial noise tricks speech enhancers into new meanings
Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?
-
Source count fusion lifts DOA F1-scores by 14% in hearing aids
Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion
-
Model turns video into object-aware stereo sound
StereoFoley: Object-Aware Stereo Audio Generation from Video
-
One model hits SOTA on text, image, audio and video
Qwen3-Omni Technical Report
-
Comparator loss yields speech severity scores for disease tracking
Comparator Loss: An Ordinal Contrastive Loss to Derive a Severity Score for Speech-based Health Monitoring
-
Single model separates and locates multiple overlapping sounds
DeepASA: An Object-Oriented Multi-Purpose Network for Auditory Scene Analysis
-
Visual cues first cluster speech features in AVSR models
Interpreting the Role of Visemes in Audio-Visual Speech Recognition
-
Differentiable method boosts acoustic modeling from limited data
Differentiable Acoustic Radiance Transfer
-
Speaker-aware model yields more realistic conversation timings
From Independence to Interaction: Speaker-Aware Simulation of Multi-Speaker Conversational Timing
-
1% data addition activates simultaneous translation in audio models
Direct Simultaneous Translation Activation for Large Audio-Language Models
-
Thai encoder and data pipeline enable multitask speech AI
Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages
-
TTS systems default to adult voices despite child or elderly instructions
Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems
-
Mixture of experts blends binaural filters for moving talkers
Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers
-
Expert routing boosts noisy emotion recognition 12 percent
Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition
-
Audio LLM toolkit delivers up to 151 percent faster evaluations
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
-
Diffusion model adds personalized events from few reference audios
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
-
Audiobook quotes with verb labels boost TTS expressivity
Computational Narrative Understanding for Expressive Text-to-Speech
-
Shared encoder beats separate models on diarization and separation
Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder
-
Self-generated data aligns speech LLMs to instructions better
Enhancing Speech Large Language Models through Reinforced Behavior Alignment
-
Adapted open LLMs match commercial dementia screening tools
Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies
-
Physics-aware kernels recover steering vectors from 10x fewer measurements
Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening
-
Fine-tuned causal Whisper outperforms streaming baselines under 300ms
WhisperRT -- Turning Whisper into a Causal Streaming Model
-
ASR fixes grouped into five classes with shared metrics
Non-Intrusive Automatic Speech Recognition Refinement: A Survey
-
One model restores and masters music from text instructions
SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering
-
Benchmark with multi-expert captions tests detailed audio understanding
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
-
New benchmark finds two strategies in full-duplex speech models
Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models
-
Step-Audio 2 sets SOTA on audio understanding and conversation
Step-Audio 2 Technical Report
-
Chaos discriminators set new SOTA for speech bandwidth extension
CIS-BWE: Chaos-Informed Speech Bandwidth Extension
-
Pipeline adds stress and punctuation to Russian speech data
Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech
-
Text transcriptions boost attacks on voice anonymization
VoxATtack: A Multimodal Attack on Voice Anonymization Systems
-
AI model builds 3D ocean sound maps from surface data alone
A Multimodal Data Fusion Attention-Empowered Generative Adversarial Network for Real Time 3D Underwater Sound Speed Field Construction
-
Flow-matching model generates spoken dialogues faster
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
-
Open audio model tops 20+ benchmarks with public data only
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
-
Coupled quadratic program cuts distortion in multichannel mixers
Constraint Optimized Multichannel Mixer-limiter Design
-
One-step model beats 30-step teacher in speech enhancement
Robust One-step Speech Enhancement via Consistency Distillation
-
Unified model generates speech and facial motion together
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
-
Scattered sound classifies hair type and moisture at nearly 90 percent
Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment
-
Calibrated distillation lifts low-complexity speech enhancement
Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement