archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 12
-
Grounding models yield phonetic features for speech recognition
Transfer Learning from Audio-Visual Grounding to Speech Recognition
-
Attention improves accuracy for all 20 instruments on OpenMIC
An Attention Mechanism for Musical Instrument Recognition
-
Domain teachers train one student model to cut ASR errors by 10.4%
Teach an all-rounder with experts in different domains
-
Joint model cuts speaker diarization error to 2.2%
Joint Speech Recognition and Speaker Diarization via Sequence Transduction
-
Seq2seq ASR cuts WER 25% with speaker adaptation
Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR
-
Cohort pruning enables private score normalisation in speaker recognition
Privacy-Preserving Speaker Recognition with Cohort Score Normalisation
-
Adversarial method cuts speech recognition errors 5-14 percent
NIESR: Nuisance Invariant End-to-end Speech Recognition
-
Shared-layer DNN beats early and late fusion on emotion CCC
Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition
-
Autoencoder codebook lifts audio emotion prediction scores
Bag-of-Audio-Words based on Autoencoder Codebook for Continuous Emotion Prediction
-
Activation maximization yields class-specific speech from DNNs
Towards Debugging Deep Neural Networks by Generating Speech Utterances
-
17.55 hours of unlabelled Somali audio cut ASR error by 7.74%
Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training
-
CNN learns delays to align speech with emotion labels
Jointly Aligning and Predicting Continuous Emotion Annotations
-
WaveNet upsamples 8 kHz GSM speech near AMR-WB quality
Speech bandwidth extension with WaveNet
-
Deep learning adds controllable emotion to synthetic speech
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach
-
Compensation protocol fixes Unity timing issues for AV research
Synchronizing Audio-Visual Film Stimuli in Unity (version 5.5.1f1): Game Engines as a Tool for Research
-
Transformer spots chords via adaptive attention segments
A Bi-directional Transformer for Musical Chord Recognition
-
ResNet detects replays at 1.08% EER using perturbed group delay grams
The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion
-
Phoneme timestamps stabilize prosody transfer from unseen speakers
Fine-grained robust prosody transfer for single-speaker neural text-to-speech
-
Neural net turns any-length audio into full-pose lecture videos
Lumi\`ereNet: Lecture Video Synthesis from Audio
-
Frame attention in convRNN sets ESC accuracy records
Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification
-
DKU pipeline reaches 4.96% EER on distant speaker task
The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge
-
Multi-extractor speaker system reaches 0.392 and 0.494 detection costs
The DKU-SMIIP System for NIST 2018 Speaker Recognition Evaluation
-
CNNs on cochlear features improve speech enhancement for implants
Convolutional Neural Network-based Speech Enhancement for Cochlear Implant Recipients
-
High frame rates reduce ASR word error rates by up to 24%
End-to-End Speech Recognition with High-Frame-Rate Features Extraction
-
CNN layers mirror classical audio features in instrument recognition
A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features
-
Tuned receptive fields let ResNet beat VGG on audio scenes
The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification
-
Conditional net reaches 94.69% on Mandarin polyphone task
Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features
-
CNNs outperform dense nets on noisy user labels for VoIP audio
Supervised Classifiers for Audio Impairments with Noisy Labels
-
Hierarchical VAE-GAN generates 136-beat melodies with form
MIDI-Sandwich: Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track Music Generation
-
Sub-band CNN cuts spoken term classification compute by up to 49%
Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification
-
Decoding trick trains attention models to detect speech features end-to-end
Attention model for articulatory features detection
-
Image context lifts UAV voice command accuracy despite noisy pairings
Kite: Automatic speech recognition for unmanned aerial vehicles
-
Robot reveals full room geometry from random start using sound
Can a Robot Hear the Shape and Dimensions of a Room?
-
Speech separation gains hold up under real ambient noise
WHAM!: Extending Speech Separation to Noisy Environments
-
Cognitive models plus multi-agent rules raise game music immersion
Adaptive Music Composition for Games
-
Two-word recombination enables real-time LSTM LVCSR decoding
LSTM Language Models for LVCSR in First-Pass Decoding and Lattice-Rescoring
-
Distillation plus quantization shrinks AED models to 2% size
Compression of Acoustic Event Detection Models With Quantized Distillation
-
UltraSuite releases ultrasound data from child speech therapy
UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions
-
Disentangling flows organize synthesizer latent space
Universal audio synthesizer control with normalizing flows
-
TTS data and neural denorming cut numeric ASR WER by up to 8x
Improving Performance of End-to-End ASR on Numeric Sequences
-
GAN vocoder beats classical methods on perceptual scores
Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding
-
Mean frame lifts speaker-independent ultrasound classification
Speaker-independent classification of phonetic segments from raw ultrasound in child speech
-
Cosine similarity degrades subsidiary models more efficiently than cross-entropy
Cosine similarity-based adversarial process
-
ResNet yields better multilingual bottleneck features for spoken term detection
Multilingual Bottleneck Features for Query by Example Spoken Term Detection
-
Bi-directional network raises joint intent-slot accuracy
A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling
-
Voice embeddings cut expression detection error by 60%
Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice
-
Reflection paths extend image sources to curved boundaries
An Image Source Method Framework for Arbitrary Reflecting Boundaries
-
Multi-view lip videos yield better speech from silence
Lipper: Synthesizing Thy Speech using Multi-View Lipreading
-
SVD-PHAT cuts multi-source localization error by up to 0.0395 radians
Multiple Sound Source Localization with SVD-PHAT
-
Artist album track metadata trains music representations
Representation Learning of Music Using Artist, Album, and Track Information