A hybrid two-stage framework pairs a discriminative front-end for interference suppression with a generative decoder-only LM back-end to improve perceptual quality and speaker consistency in target speaker extraction and speech enhancement.
An efficient encoder-decoder architec- ture with top-down attention for speech separation,
2 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 2representative citing papers
CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yielding efficiency gains over AudioSep.
citing papers explorer
-
Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models
A hybrid two-stage framework pairs a discriminative front-end for interference suppression with a generative decoder-only LM back-end to improve perceptual quality and speaker consistency in target speaker extraction and speech enhancement.
-
CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents
CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yielding efficiency gains over AudioSep.