A neural model predicts a set of speaker embeddings from noisy mixtures to enable enrollment-free target speech extraction, outperforming baselines on LibriMix and generalizing to real recordings.
Perceptual eval- uation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
baseline 1
citation-polarity summary
fields
eess.AS 2years
2026 2verdicts
UNVERDICTED 2roles
baseline 1polarities
baseline 1representative citing papers
A text-to-audio generative model is adapted for room impulse response generation using vision-language model labeling of image-RIR datasets and in-context learning for free-form prompts.
citing papers explorer
-
Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction
A neural model predicts a set of speaker embeddings from noisy mixtures to enable enrollment-free target speech extraction, outperforming baselines on LibriMix and generalizing to real recordings.
-
Adapting a Text-to-Audio Model for Room Impulse Response Generation
A text-to-audio generative model is adapted for room impulse response generation using vision-language model labeling of image-RIR datasets and in-context learning for free-form prompts.