archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 9

eess.AS 2025-06-11 reviewed

Text alone identifies speakers at 2% error in privacy tests
You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks

\"Unal Ege Gaznepoglu +6
cs.SD 2025-06-08 reviewed

AI bass model produces polyphony inside single harmonic tones
Insights on Harmonic Tones from a Generative Music Experiment

Emmanuel Deruty +1
cs.CL 2025-06-05 reviewed

Benchmark finds SpeechLLMs weak on speech nuances beyond text
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang +6
eess.AS 2025-06-02 reviewed

Ensemble method adds confidence intervals to speech boundaries
Gradient boundaries through confidence intervals for forced alignment estimates using model ensembles

Matthew C. Kelley
cs.CL 2025-06-01 reviewed

LLM pipeline creates sarcastic speech dataset with 73.63% F1
Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection

Zhu Li +4
eess.AS 2025-05-31 reviewed

Two-stage transfer learning predicts P.835 scores from 100 labels
Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for VoiceMOS 2024

Marie Kune\v{s}ov\'a +2
cs.SD 2025-05-30 reviewed

Neural codec reaches 2.87 PESQ at 2.67 kbps
SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

Jin Wang +4
cs.CL 2025-05-28 reviewed

Speech LMs miss meaning shifts from sentence stress
StressTest: Can YOUR Speech LM Handle the Stress?

Iddo Yosha +2
cs.SD 2025-05-28 reviewed

Fixed decoder raises audio steganography quality by over 10 dB
FGAS: Fixed Decoder Network-Based Audio Steganography with Adversarial Perturbation Generation

Jialin Yan +6
cs.SD 2025-05-27 reviewed

Tailored designs succeed on music AVQA where general models struggle
Music Audio-Visual Question Answering Requires Specialized Multimodal Designs

Wenhao You +11
cs.SD 2025-05-23 reviewed

CosyVoice 3 scales speech data to one million hours for stronger zero-shot results
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du +21
eess.AS 2025-05-21 reviewed

Taxonomy sorts LALM benchmarks into four objective-based dimensions
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang +2
cs.SD 2025-05-20 reviewed

FMSD-TTS creates U-Tsang Amdo Kham speech from few clips
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation

Yutong Liu +9
eess.AS 2025-05-20 reviewed

Framework edits speech amid overlapping background noise
SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement

Kuan-Yu Chen +3
cs.SD 2025-05-19 reviewed

One model translates music scores
Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

Jongmin Jung +7
cs.SD 2025-05-15 reviewed

Random linear map turns audio embeddings into dynamic visuals
LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2

Jongmin Jung +1
math.CO 2025-05-13 reviewed

Tonnetz realized as twelve points and twelve lines in the plane
Configurations, Tessellations and Tone Networks

Jeffrey R. Boland +1
cs.SD 2025-05-13 reviewed

Drum grooves edited zero-shot by plain LLMs via spatial text grid
Not that Groove: Zero-Shot Symbolic Music Editing

Li Zhang
eess.AS 2025-05-03 reviewed

Device info at inference lifts scene classification baseline
Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge

Florian Schmid +5
eess.AS 2025-05-01 reviewed

Anonymized speech preserves clinical ratings but lowers perceived quality
Perceptual implications of automatic anonymization in pathological speech

Soroosh Tayebi Arasteh +13
eess.AS 2025-04-25 reviewed

Open audio model hits state-of-the-art on speech and conversation benchmarks
Kimi-Audio Technical Report

KimiTeam +39
eess.AS 2025-04-24 reviewed

One speaker creates multiple sound zones with multi-frequency ultrasound
Generating Localized Audible Zones Using a Single-Channel Parametric Loudspeaker

Tao Zhuang +4
cs.SD 2025-04-17 reviewed

Multi-task attention CNN hits 97% accuracy on scarce underwater sounds
A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition

Wei Huang +5
eess.AS 2025-04-16 reviewed

Augmentation lifts deepfake detection accuracy under codecs and loss
Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios

Haohan Shi +5
eess.AS 2025-04-11 reviewed

Reverberation features lift distance accuracy in 3D sound detection
Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation

Davide Berghi +1
cs.CL 2025-04-11 reviewed

Survey groups spoken language models by architecture
On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora +9
eess.AS 2025-04-01 reviewed

Cyclic sound patterns engineered to trigger ASMR
Is ASMR Engineerable? A Signal Processing and User Experience Study

Zexin Fang +4
cs.CL 2025-03-30 reviewed

Hybrid model lifts end-turn accuracy at low compute cost
Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

Hyunjong Ok +2
cs.AR 2025-03-27 reviewed

71.2 μW accelerator runs real-time speech recognition
A 71.2-$\mu$W Speech Recognition Accelerator with Recurrent Spiking Neural Network

Chih-Chyau Yang +1
cs.CL 2025-03-26 reviewed

Qwen2.5-Omni matches text performance on speech tasks
Qwen2.5-Omni Technical Report

Jin Xu +13
cs.MM 2025-03-13 reviewed

One model turns text, video or audio prompts into sound
AudioX: A Unified Framework for Anything-to-Audio Generation

Zeyue Tian +8
cs.CL 2025-03-07 reviewed

Benchmark shows industrial S2S models outperform academic ones on tone and emotion
S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models

Feng Jiang +8
cs.SD 2025-03-03 reviewed

Single-stream codec splits speech for LLM voice control
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang +24
cs.CR 2025-02-27 reviewed

Simple audio tweaks fool all tested deepfake detectors
DeePen: Penetration Testing for Audio Deepfake Detection

Nicolas M\"uller +7
cs.GR 2025-02-25 reviewed

Text prompts steer 3D dance generation to match music genres
GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

Xinran Liu +5
cs.CL 2025-02-17 reviewed

130B model unifies speech and text for real-time interaction
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Ailin Huang +144
cs.SD 2025-02-17 reviewed

Paired throat-acoustic dataset trains models to restore lost speech frequencies
Throat and acoustic paired speech dataset for deep learning-based speech enhancement

Yunsik Kim +2
eess.AS 2025-02-09 reviewed

Silent EMG signals map directly to phonemic text
Non-invasive electromyographic speech neuroprosthesis: a geometric perspective

Harshavardhana T. Gowda +1
cs.SD 2025-02-07 reviewed

Four-axis guidelines automate audio quality scoring
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra +12
cs.SD 2025-02-06 reviewed

Cross-attention audio watermark survives generative edits
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Yixin Liu +4
eess.AS 2025-02-04 reviewed

Full recordings classify dementia without trimming speech
Dementia classification from spontaneous speech using wrapper-based feature selection

Marko Niemel\"a +3
eess.AS 2025-01-30 reviewed

Multilayer unit gives one reflector fine phase and amplitude control
ML-ARIS: Multilayer Underwater Acoustic Reconfigurable Intelligent Surface with High-Resolution Reflection Control

Lina Pu +2
cs.SD 2025-01-13 reviewed

Classical music networks show centuries of simplification
Decoding Musical Evolution Through Network Science

Niccolo' Di Marco +4
cs.CV 2025-01-03 reviewed

Staged training adds speech understanding to vision models
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu +15
cs.LG 2024-12-20 reviewed

Finetuning lets text-to-audio models respect event relations
RiTTA: Modeling Event Relations in Text-to-Audio Generation

Yuhang He +4
cs.SD 2024-12-18 reviewed

ResNet18 leads detection of machine-generated music
Explainable Detection of Machine Generated Music and Early Systematic Evaluation

Yupei Li +4
cs.LG 2024-12-17 reviewed

MoInCL reduces forgetting when MLLMs switch modalities and task types
Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Weiguo Pian +5
cs.SD 2024-12-13 reviewed

CosyVoice 2 hits human parity in streaming speech synthesis
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du +18
cs.SD 2024-12-09 reviewed

Diffusion refiner boosts any music source separator
Improving Music Source Separation with Diffusion and Consistency Refinement

Tornike Karchkhadze +3
cs.CL 2024-12-03 reviewed

End-to-end voice model hits SOTA on spoken QA and modeling
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng +7