archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 9
-
Text alone identifies speakers at 2% error in privacy tests
You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks
-
AI bass model produces polyphony inside single harmonic tones
Insights on Harmonic Tones from a Generative Music Experiment
-
Benchmark finds SpeechLLMs weak on speech nuances beyond text
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
-
Ensemble method adds confidence intervals to speech boundaries
Gradient boundaries through confidence intervals for forced alignment estimates using model ensembles
-
LLM pipeline creates sarcastic speech dataset with 73.63% F1
Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection
-
Two-stage transfer learning predicts P.835 scores from 100 labels
Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for VoiceMOS 2024
-
Neural codec reaches 2.87 PESQ at 2.67 kbps
SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization
-
Speech LMs miss meaning shifts from sentence stress
StressTest: Can YOUR Speech LM Handle the Stress?
-
Fixed decoder raises audio steganography quality by over 10 dB
FGAS: Fixed Decoder Network-Based Audio Steganography with Adversarial Perturbation Generation
-
Tailored designs succeed on music AVQA where general models struggle
Music Audio-Visual Question Answering Requires Specialized Multimodal Designs
-
CosyVoice 3 scales speech data to one million hours for stronger zero-shot results
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
-
Taxonomy sorts LALM benchmarks into four objective-based dimensions
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
-
FMSD-TTS creates U-Tsang Amdo Kham speech from few clips
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation
-
Framework edits speech amid overlapping background noise
SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement
-
One model translates music scores
Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio
-
Random linear map turns audio embeddings into dynamic visuals
LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2
-
Tonnetz realized as twelve points and twelve lines in the plane
Configurations, Tessellations and Tone Networks
-
Drum grooves edited zero-shot by plain LLMs via spatial text grid
Not that Groove: Zero-Shot Symbolic Music Editing
-
Device info at inference lifts scene classification baseline
Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge
-
Anonymized speech preserves clinical ratings but lowers perceived quality
Perceptual implications of automatic anonymization in pathological speech
-
Open audio model hits state-of-the-art on speech and conversation benchmarks
Kimi-Audio Technical Report
-
One speaker creates multiple sound zones with multi-frequency ultrasound
Generating Localized Audible Zones Using a Single-Channel Parametric Loudspeaker
-
Multi-task attention CNN hits 97% accuracy on scarce underwater sounds
A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition
-
Augmentation lifts deepfake detection accuracy under codecs and loss
Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios
-
Reverberation features lift distance accuracy in 3D sound detection
Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation
-
Survey groups spoken language models by architecture
On The Landscape of Spoken Language Models: A Comprehensive Survey
-
Cyclic sound patterns engineered to trigger ASMR
Is ASMR Engineerable? A Signal Processing and User Experience Study
-
Hybrid model lifts end-turn accuracy at low compute cost
Speculative End-Turn Detector for Efficient Speech Chatbot Assistant
-
71.2 μW accelerator runs real-time speech recognition
A 71.2-$\mu$W Speech Recognition Accelerator with Recurrent Spiking Neural Network
-
Qwen2.5-Omni matches text performance on speech tasks
Qwen2.5-Omni Technical Report
-
One model turns text, video or audio prompts into sound
AudioX: A Unified Framework for Anything-to-Audio Generation
-
Benchmark shows industrial S2S models outperform academic ones on tone and emotion
S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models
-
Single-stream codec splits speech for LLM voice control
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
-
Simple audio tweaks fool all tested deepfake detectors
DeePen: Penetration Testing for Audio Deepfake Detection
-
Text prompts steer 3D dance generation to match music genres
GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation
-
130B model unifies speech and text for real-time interaction
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
-
Paired throat-acoustic dataset trains models to restore lost speech frequencies
Throat and acoustic paired speech dataset for deep learning-based speech enhancement
-
Silent EMG signals map directly to phonemic text
Non-invasive electromyographic speech neuroprosthesis: a geometric perspective
-
Four-axis guidelines automate audio quality scoring
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
-
Cross-attention audio watermark survives generative edits
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
-
Full recordings classify dementia without trimming speech
Dementia classification from spontaneous speech using wrapper-based feature selection
-
Multilayer unit gives one reflector fine phase and amplitude control
ML-ARIS: Multilayer Underwater Acoustic Reconfigurable Intelligent Surface with High-Resolution Reflection Control
-
Classical music networks show centuries of simplification
Decoding Musical Evolution Through Network Science
-
Staged training adds speech understanding to vision models
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
-
Finetuning lets text-to-audio models respect event relations
RiTTA: Modeling Event Relations in Text-to-Audio Generation
-
ResNet18 leads detection of machine-generated music
Explainable Detection of Machine Generated Music and Early Systematic Evaluation
-
MoInCL reduces forgetting when MLLMs switch modalities and task types
Modality-Inconsistent Continual Learning of Multimodal Large Language Models
-
CosyVoice 2 hits human parity in streaming speech synthesis
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
-
Diffusion refiner boosts any music source separator
Improving Music Source Separation with Diffusion and Consistency Refinement
-
End-to-end voice model hits SOTA on spoken QA and modeling
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot