archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 10

cs.MM 2024-11-26 reviewed

Both global and shared position IDs align video text and speech
Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

Akshita Gupta +5
cs.SD 2024-11-24 reviewed

Image diffusion models transfer music styles without training
Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms

Heehwan Wang +6
cs.SD 2024-11-19 reviewed

DGSNA generates dynamic scene-based noise via prompts and diffusion models to augment…
DGSNA: Dynamic Generative Scene-based Noise Addition method

Zihao Chen +4
cs.SD 2024-11-06 reviewed

Pooling speech datasets improves quality model generalization
MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

Wen-Chin Huang +2
cs.SD 2024-11-05 reviewed

Slide text cues extract target speaker from mixed audio
pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

Ziyang Jiang +6
cs.CL 2024-10-25 reviewed

GPT-4o responds to audio inputs in 232 milliseconds
GPT-4o System Card

OpenAI: Aaron Hurst +415
eess.AS 2024-10-24 reviewed

Top audio models score only 53 percent on expert reasoning benchmark
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi +8
cs.SD 2024-10-23 reviewed

Equivariant transformer beats prototype on chord accompaniment
Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

Weiliang Luo
cs.CL 2024-10-22 reviewed

VoiceBench tests LLM voice assistants on varied real-world speech
VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen +5
eess.AS 2024-10-09 reviewed

Text padding plus ConvNeXt yields 0.15 RTF zero-shot TTS
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen +7
eess.AS 2024-10-04 reviewed

Dataset supplies first mixed cardiopulmonary sounds from manikin
Manikin-Recorded Cardiopulmonary Sounds Dataset Using Digital Stethoscope

Yasaman Torabi +2
cs.SD 2024-09-27 reviewed

Two-stage method improves emotion and speaker match in zero-shot TTS
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS

Haoyu Wang +8
eess.AS 2024-09-17 reviewed

Moshi delivers real-time full-duplex speech at 160 ms latency
Moshi: a speech-text foundation model for real-time dialogue

Alexandre D\'efossez +7
cs.SD 2024-08-30 reviewed

KAN-enhanced AASIST more than halves deepfake detection error
AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

Kirill Borodin +6
cs.SD 2024-08-14 reviewed

Tuned MFCC parameters lift respiratory detection accuracy by up to 19.6%
Optimising MFCC parameters for the automatic detection of respiratory diseases

Yuyang Yan +5
eess.AS 2024-07-15 reviewed

Audio model outperforms Gemini on voice instruction tasks
Qwen2-Audio Technical Report

Yunfei Chu +11
cs.SD 2024-07-07 reviewed

Supervised tokens improve zero-shot TTS cloning
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du +11
cs.SD 2024-06-20 reviewed

Discrete tokens lag continuous features on audio tasks
DASB - Discrete Audio and Speech Benchmark

Pooneh Mousavi +7
eess.AS 2024-06-04 reviewed

TTS model matches human speech in similarity and naturalness
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou +45
cs.SD 2024-06-03 reviewed

Self-supervised transformer learns rare animal calls from unlabeled audio
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Julian C. Sch\"afer-Zimmermann +11
eess.SP 2024-05-10 reviewed

Lightweight net detects heart murmurs on phones with 80% accuracy
FunnelNet: An End-to-End Deep Learning Framework to Monitor Digital Heart Murmur in Real-Time

Md Jobayer +6
cs.SD 2024-03-30 reviewed

5-second clips classify pediatric heart sounds at 93.69% accuracy
Classification of Short Segment Pediatric Heart Sounds Based on a Transformer-Based Convolutional Neural Network

Md Hassanuzzaman +5
cs.HC 2024-03-25 reviewed

Humans detect AI media at coin-toss accuracy
As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli

Di Cooke +3
cs.SD 2024-02-12 reviewed

HuBERT detects COVID-19 from voice at 86% accuracy
Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data

Yuyang Yan +3
cs.CY 2024-01-24 reviewed

Community input required for AI reviewing police stops
Community-Informed AI Models for Police Accountability

Benjamin A.T. Graham +15
cs.SD 2024-01-17 reviewed

Multi-language dataset of 175 TTS voices boosts deepfake detector training
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Nicolas M. M\"uller +8
cs.SD 2023-12-28 reviewed

Knowledge transfer reconstructs missing audio to improve sentiment analysis
Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach

Weide Liu +1
eess.AS 2023-11-14 reviewed

One audio model covers 30+ tasks without fine-tuning
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu +7
cs.SD 2023-10-20 reviewed

Model lets LLMs hear speech, sounds and music directly
SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang +8
cs.SD 2023-09-22 reviewed

Deepfake audio augments speech-to-text training data
Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Alexandre R. Ferreira +1
cs.CL 2023-06-22 reviewed

Fused text-speech model beats prior translation systems
AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K. Rubenstein +29
cs.CL 2023-06-05 reviewed

Video-LLaMA adds Q-formers so LLMs grasp video sights and sounds
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang +2
cs.CL 2023-05-02 reviewed

Nets trained on single words start concatenating them
Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks

Ga\v{s}per Begu\v{s} +2
cs.SD 2023-01-26 reviewed

MusicLM turns text into minutes of consistent 24 kHz music
MusicLM: Generating Music From Text

Andrea Agostinelli +12
cs.CL 2023-01-05 reviewed

Discrete audio code model enables zero-shot TTS from 3s prompt
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang +12
eess.AS 2022-12-06 reviewed

Scale to 680k hours enables zero-shot speech recognition
Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford +5
cs.SD 2022-11-22 reviewed

Two-stage filter cleans noisy labels for speaker verification
Robust Training for Speaker Verification against Noisy Labels

Zhihua Fang +4
eess.AS 2022-10-24 reviewed

Neural codec beats baselines at real-time high-fidelity audio compression
High Fidelity Neural Audio Compression

Alexandre D\'efossez +3
cs.SD 2021-10-12 reviewed

Augmented ConvNet classifies COVID coughs at 87 percent AUC
COVID-19 Diagnosis from Cough Acoustics using ConvNets and Data Augmentation

Saranga Kingkor Mahanta +4
cs.LG 2021-07-30 reviewed

One architecture handles any input and any output structure at linear cost
Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle +14
eess.AS 2021-06-29 reviewed

Five fixed channels unify monaural and binaural auditory model
Towards a generalized monaural and binaural auditory model for psychoacoustics and speech intelligibility

Thomas Biberger +1
cs.CL 2021-05-26 reviewed

Multitask model lowers Anglicism errors in German ASR by 3%
Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in German Speech Recognition

Julia Pritzen +3
cs.SD 2021-05-03 reviewed

Neural net reaches SOTA on 20-instrument task with MFCCs only
Deep Neural Network for Musical Instrument Recognition using MFCCs

Saranga Kingkor Mahanta +2
eess.AS 2020-12-07 reviewed

Multilingual dataset supplies 50,000 hours of speech audio
MLS: A Large-Scale Multilingual Dataset for Speech Research

Vineel Pratap +4
eess.AS 2020-09-21 reviewed

Diffusion model matches WaveNet audio quality but runs far faster
DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong +4
eess.AS 2020-04-30 reviewed

Jukebox generates coherent multi-minute songs with vocals in raw audio
Jukebox: A Generative Model for Music

Prafulla Dhariwal +5
cs.SD 2019-12-24 reviewed

Neural net predicts giant panda mating success from calls
Audio-based automatic mating success prediction of giant pandas

Weiran Yan +6
eess.AS 2019-07-27 reviewed

Residual filtering removes differential prediction from any voice converter
Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Wen-Chin Huang +8
cs.CL 2019-07-26 reviewed

Many speech papers misuse the term 'phoneme'
On the Use/Misuse of the Term 'Phoneme'

Roger K. Moore +1
eess.AS 2019-07-26 reviewed

Skip connections plus correlation penalty cut speech errors in noise
Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature Enhancement

Alzahra Badi +3