archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 10
-
Both global and shared position IDs align video text and speech
Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis
-
Image diffusion models transfer music styles without training
Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms
-
DGSNA generates dynamic scene-based noise via prompts and diffusion models to augment…
DGSNA: Dynamic Generative Scene-based Noise Addition method
-
Pooling speech datasets improves quality model generalization
MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models
-
Slide text cues extract target speaker from mixed audio
pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues
-
GPT-4o responds to audio inputs in 232 milliseconds
GPT-4o System Card
-
Top audio models score only 53 percent on expert reasoning benchmark
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
-
Equivariant transformer beats prototype on chord accompaniment
Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment
-
VoiceBench tests LLM voice assistants on varied real-world speech
VoiceBench: Benchmarking LLM-Based Voice Assistants
-
Text padding plus ConvNeXt yields 0.15 RTF zero-shot TTS
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
-
Dataset supplies first mixed cardiopulmonary sounds from manikin
Manikin-Recorded Cardiopulmonary Sounds Dataset Using Digital Stethoscope
-
Two-stage method improves emotion and speaker match in zero-shot TTS
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
-
Moshi delivers real-time full-duplex speech at 160 ms latency
Moshi: a speech-text foundation model for real-time dialogue
-
KAN-enhanced AASIST more than halves deepfake detection error
AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge
-
Tuned MFCC parameters lift respiratory detection accuracy by up to 19.6%
Optimising MFCC parameters for the automatic detection of respiratory diseases
-
Audio model outperforms Gemini on voice instruction tasks
Qwen2-Audio Technical Report
-
Supervised tokens improve zero-shot TTS cloning
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
-
Discrete tokens lag continuous features on audio tasks
DASB - Discrete Audio and Speech Benchmark
-
TTS model matches human speech in similarity and naturalness
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
-
Self-supervised transformer learns rare animal calls from unlabeled audio
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics
-
Lightweight net detects heart murmurs on phones with 80% accuracy
FunnelNet: An End-to-End Deep Learning Framework to Monitor Digital Heart Murmur in Real-Time
-
5-second clips classify pediatric heart sounds at 93.69% accuracy
Classification of Short Segment Pediatric Heart Sounds Based on a Transformer-Based Convolutional Neural Network
-
Humans detect AI media at coin-toss accuracy
As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli
-
HuBERT detects COVID-19 from voice at 86% accuracy
Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data
-
Community input required for AI reviewing police stops
Community-Informed AI Models for Police Accountability
-
Multi-language dataset of 175 TTS voices boosts deepfake detector training
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset
-
Knowledge transfer reconstructs missing audio to improve sentiment analysis
Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach
-
One audio model covers 30+ tasks without fine-tuning
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
-
Model lets LLMs hear speech, sounds and music directly
SALMONN: Towards Generic Hearing Abilities for Large Language Models
-
Deepfake audio augments speech-to-text training data
Deepfake audio as a data augmentation technique for training automatic speech to text transcription models
-
Fused text-speech model beats prior translation systems
AudioPaLM: A Large Language Model That Can Speak and Listen
-
Video-LLaMA adds Q-formers so LLMs grasp video sights and sounds
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
-
Nets trained on single words start concatenating them
Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
-
MusicLM turns text into minutes of consistent 24 kHz music
MusicLM: Generating Music From Text
-
Discrete audio code model enables zero-shot TTS from 3s prompt
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
-
Scale to 680k hours enables zero-shot speech recognition
Robust Speech Recognition via Large-Scale Weak Supervision
-
Two-stage filter cleans noisy labels for speaker verification
Robust Training for Speaker Verification against Noisy Labels
-
Neural codec beats baselines at real-time high-fidelity audio compression
High Fidelity Neural Audio Compression
-
Augmented ConvNet classifies COVID coughs at 87 percent AUC
COVID-19 Diagnosis from Cough Acoustics using ConvNets and Data Augmentation
-
One architecture handles any input and any output structure at linear cost
Perceiver IO: A General Architecture for Structured Inputs & Outputs
-
Five fixed channels unify monaural and binaural auditory model
Towards a generalized monaural and binaural auditory model for psychoacoustics and speech intelligibility
-
Multitask model lowers Anglicism errors in German ASR by 3%
Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in German Speech Recognition
-
Neural net reaches SOTA on 20-instrument task with MFCCs only
Deep Neural Network for Musical Instrument Recognition using MFCCs
-
Multilingual dataset supplies 50,000 hours of speech audio
MLS: A Large-Scale Multilingual Dataset for Speech Research
-
Diffusion model matches WaveNet audio quality but runs far faster
DiffWave: A Versatile Diffusion Model for Audio Synthesis
-
Jukebox generates coherent multi-minute songs with vocals in raw audio
Jukebox: A Generative Model for Music
-
Neural net predicts giant panda mating success from calls
Audio-based automatic mating success prediction of giant pandas
-
Residual filtering removes differential prediction from any voice converter
Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion
-
Many speech papers misuse the term 'phoneme'
On the Use/Misuse of the Term 'Phoneme'
-
Skip connections plus correlation penalty cut speech errors in noise
Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature Enhancement