archive
Every paper Pith has read. Search by title, abstract, or pith.
623 papers in eess.AS · page 13
-
Gated embeddings cut error in conversational speech recognition
Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
-
Lattices enable acoustic model adaptation at over 50% error rate
Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models
-
Re-annotation produces 1369 public cough events from AMI corpus
Re-annotation of cough events in the AMI corpus
-
Nonverbal speech features predict group performance on their own
Analyzing Verbal and Nonverbal Features for Predicting Group Performance
-
One embedding space aligns monophonic vocals with full mixes
Learning a Joint Embedding Space of Monophonic and Mixed Music Signals for Singing Voice
-
Soft attention makes audio-to-sheet retrieval tempo-invariant
Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval
-
Russian corpus gives 31 hours of one-speaker speech for TTS
RUSLAN: Russian Spoken Language Corpus for Speech Synthesis
-
Auxiliary loss cuts target-speaker error by 6.6 percent
Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition
-
Style tokens map to emotions with 5% labels
End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training
-
Essence filtering lets single speech model beat its teacher ensemble
Essence Knowledge Distillation for Speech Recognition
-
Fusion of audio and video features reaches 0.75 CCC for arousal
Emotion Recognition Using Fusion of Audio and Video Features
-
Teacher-student loop aligns 5358 tracks of audio with lyrics
DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm
-
Word CTC on 3D-2D-CNN-BLSTM hits 1.3% lipreading WER
LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models
-
3D CNN ensemble beats baseline in AVA speaker detection
Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)
-
Adapted solo-singing models cut polyphonic lyrics alignment errors
Acoustic Modeling for Automatic Lyrics-to-Audio Alignment
-
Contrastive loss transfers text knowledge to audio emotion models
Multimodal and Multi-view Models for Emotion Recognition
-
Audio-visual enrollment improves speaker diarisation in meetings
Who said that?: Audio-visual speaker diarisation of real-world meetings
-
Speaker embeddings raise single-channel separation to 4.79 dB SDR
Single-Channel Speech Separation with Auxiliary Speaker Embeddings
-
Balancing and MTL lift end-to-end ASR for Hindi-English code-switching
End-to-End ASR for Code-switched Hindi-English Speech
-
Multi-task network lifts KWS accuracy 32% for hearing aids
Keyword Spotting for Hearing Assistive Devices Robust to External Speakers
-
Scattering coefficients re-synthesize audio textures and enable new effects
The Shape of RemiXXXes to Come: Audio Texture Synthesis with Time-frequency Scattering
-
Phoneme biasing lifts foreign name accuracy 16% over grapheme baselines
Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models
-
VAE predicts future music values to compose new pieces
Classical Music Prediction and Composition by means of Variational Autoencoders
-
ADSR HMM fusion yields SOTA piano transcription on MAPS
Deep Polyphonic ADSR Piano Note Transcription
-
Querying style-trained VAE with different music yields structured blends
Query-based Deep Improvisation
-
TensorFlow model matches Kaldi accuracy in WFST decoder
Integration of TensorFlow based Acoustic Model with Kaldi WFST Decoder
-
Autoregressive models improve singing voice F0 prediction over RNNs
Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling
-
Echoes enable 2D localization from two microphones
Mirage: 2D Source Localization Using Microphone Pair Augmentation with Echoes
-
Melody features classify Hindustani
Understanding and Classifying Cultural Music Using Melodic Features Case Of Hindustani, Carnatic And Turkish Music
-
Subspace rotation normalizes narrowband stats for wideband DOA
A Signal Subspace Rotation Method for Localization of Multiple Wideband Sound Sources
-
Updating UBM during i-vector training yields 1-2% gains
Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration
-
Multi-brain fMRI embeddings outperform raw data in genre and topic tasks
Low-dimensional Embodied Semantics for Music and Language
-
Adversarial training lowers error in music transcription
Adversarial Learning for Improved Onsets and Frames Music Transcription
-
Joint training boosts keyword spotting in noise
A Monaural Speech Enhancement Method for Robust Small-Footprint Keyword Spotting
-
DL enhances MELP codec parameters directly in noise
Parameter Enhancement for MELP Speech Codec in Noisy Communication Environment