Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation
Pith reviewed 2026-05-15 06:34 UTC · model grok-4.3
The pith
SR-CorrNet separates speech by splitting coarse separation into the encoder and progressive reconstruction into a shared-weight decoder that interacts across speakers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an asymmetric encoder-decoder backbone with a separation-reconstruction (SepRe) strategy and correlation-to-filter estimation recovers target signals more reliably than late-split architectures by enabling stage-wise refinement and cross-speaker interaction before the final output.
What carries the argument
The SepRe strategy inside a TF dual-path network, where the encoder performs coarse separation and the weight-shared decoder performs progressive reconstruction using cross-speaker interaction, combined with direct estimation of deep filters from spatio-spectro-temporal correlations.
If this is right
- Consistent SI-SDR and PESQ gains on WSJ0-2Mix through 5Mix, WHAMR!, and LibriCSS in both single- and multi-channel settings.
- The attractor-based dynamic split module allows the same model to handle variable speaker counts without retraining.
- Correlation-based filter estimation works across anechoic, noisy-reverberant, and real-recorded conditions.
- Stage-wise refinement in the decoder produces more speaker-discriminative features than single-stage late splitting.
Where Pith is reading between the lines
- The correlation-to-filter view could be applied to related tasks such as speech enhancement or music source separation where TF structure is also dominant.
- Because the decoder progressively refines features, the architecture may support incremental or streaming inference with partial outputs at intermediate stages.
- The early separation plus cross-speaker interaction pattern might reduce the amount of post-processing needed in downstream diarization or recognition pipelines.
Load-bearing premise
That early coarse separation followed by cross-speaker interaction in the decoder will consistently avoid information loss and improve speaker discriminability more than late disentanglement, without the gains depending on dataset-specific tuning.
What would settle it
A controlled experiment on a held-out noisy-reverberant dataset in which the asymmetric SepRe model shows no improvement or lower SI-SDR than an otherwise identical late-split baseline.
Figures
read the original abstract
Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-{2,3,4,5}Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SR-CorrNet, an asymmetric encoder-decoder framework for TF-domain speech separation that introduces a separation-reconstruction (SepRe) strategy: the encoder performs coarse separation from the mixture while a weight-shared decoder progressively reconstructs speaker-discriminative features via cross-speaker interaction. Separation is reformulated as a structured correlation-to-filter problem in which spatio-spectro-temporal correlations computed from observations serve as input features for estimating deep filters; an attractor-based dynamic split module adapts the number of output streams to the actual speaker count. Experiments on WSJ0-{2,3,4,5}Mix, WHAMR!, and LibriCSS report consistent SI-SDR and PESQ gains across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings.
Significance. If the empirical gains hold under closer scrutiny, the work offers a concrete architectural alternative to late-split TF models by moving speaker disentanglement earlier and grounding filter estimation in explicit correlation features. This could improve robustness in adverse acoustics and provide a template for variable-speaker handling, with potential downstream value for multi-channel and real-world separation pipelines.
major comments (3)
- [§3] The description of the correlation-to-filter formulation (abstract and §3) provides no explicit equations for computing the spatio-spectro-temporal correlation tensors or for mapping them to the estimated deep filters; without these details it is impossible to verify whether the approach is truly parameter-free or how it differs from standard TF masking.
- [§4.3] No ablation studies isolate the contribution of the SepRe strategy or the asymmetric encoder-decoder versus a symmetric late-split baseline; the reported gains on WSJ0-2Mix through 5Mix and WHAMR! therefore cannot be confidently attributed to the proposed architectural choices rather than dataset-specific tuning or overall capacity.
- [§4.2] Results tables (Tables 1–4) present point estimates without error bars, standard deviations, or statistical significance tests across multiple random seeds; this weakens the claim of “consistent improvements” under noisy-reverberant and real-recorded conditions.
minor comments (2)
- [§3.3] The attractor-based dynamic split module is introduced without a clear statement of how the number of attractors is initialized or updated during training.
- [Figure 1] Figure 1 (architecture diagram) would benefit from explicit arrows or labels indicating where the correlation features enter the network and where the SepRe reconstruction loss is applied.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and empirical support.
read point-by-point responses
-
Referee: [§3] The description of the correlation-to-filter formulation (abstract and §3) provides no explicit equations for computing the spatio-spectro-temporal correlation tensors or for mapping them to the estimated deep filters; without these details it is impossible to verify whether the approach is truly parameter-free or how it differs from standard TF masking.
Authors: We agree that explicit equations are needed for clarity and reproducibility. The current text describes the correlation-to-filter idea at a high level but omits the precise definitions. In the revised manuscript we will insert the missing equations: the spatio-spectro-temporal correlation tensor is computed as the normalized outer product of the mixture spectrogram features across time-frequency-channel dimensions, and the deep-filter estimator is a small convolutional network that maps this tensor to per-speaker complex filters. These additions will also make explicit that the method is not parameter-free and differs from standard masking by using correlation features as the primary input representation. revision: yes
-
Referee: [§4.3] No ablation studies isolate the contribution of the SepRe strategy or the asymmetric encoder-decoder versus a symmetric late-split baseline; the reported gains on WSJ0-2Mix through 5Mix and WHAMR! therefore cannot be confidently attributed to the proposed architectural choices rather than dataset-specific tuning or overall capacity.
Authors: We acknowledge that dedicated ablations would strengthen attribution of the gains. The original experiments compare against published baselines but do not include an internal symmetric late-split control or a SepRe-ablated variant. We will add these ablation studies in the revision, reporting SI-SDR and PESQ for (i) the full SR-CorrNet, (ii) a symmetric encoder-decoder counterpart, and (iii) a version without the separation-reconstruction loop, all trained under identical conditions on the same data splits. revision: yes
-
Referee: [§4.2] Results tables (Tables 1–4) present point estimates without error bars, standard deviations, or statistical significance tests across multiple random seeds; this weakens the claim of “consistent improvements” under noisy-reverberant and real-recorded conditions.
Authors: We agree that reporting variability is important for robust claims. The original tables contain single-run point estimates. In the revised version we will retrain the models with at least five random seeds, add standard deviations and error bars to all tables, and include paired t-test p-values for the key comparisons on WHAMR! and LibriCSS to support the consistency statements. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper defines SR-CorrNet via explicit architectural choices: asymmetric encoder-decoder paths, weight-shared decoder with cross-speaker interaction, SepRe strategy, correlation-to-filter formulation, and attractor-based dynamic split. These are presented as design decisions motivated by information-bottleneck concerns, not derived by construction from fitted quantities or prior self-citations. The central claim rests on empirical results across WSJ0, WHAMR!, and LibriCSS datasets rather than any equation that reduces to its inputs. No self-definitional loops, fitted-input predictions, or load-bearing uniqueness theorems appear in the abstract or described sections. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- network weights and hyperparameters
axioms (1)
- domain assumption Late-split architectures create an information bottleneck that weakens discriminability under adverse conditions.
Reference graph
Works this paper leans on
-
[1]
TasNet: time-domain audio separation network for real-time, single-channel speech separation,
Y . Luo and N. Mesgarani, “TasNet: time-domain audio separation network for real-time, single-channel speech separation,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 696–700
work page 2018
-
[2]
Conv-TasNet: Surpassing ideal time–frequency magnitude mask- ing for speech separation,
——, “Conv-TasNet: Surpassing ideal time–frequency magnitude mask- ing for speech separation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019
work page 2019
-
[3]
Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,
Y . Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50
work page 2020
-
[4]
Attention is all you need in speech separation,
C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 21–25
work page 2021
-
[5]
TFPSNet: Time-frequency domain path scanning network for speech separation,
L. Yang, W. Liu, and W. Wang, “TFPSNet: Time-frequency domain path scanning network for speech separation,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6842–6846
work page 2022
-
[6]
TF-GridNet: Making time-frequency domain models great again for monaural speaker separation,
Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watanabe, “TF-GridNet: Making time-frequency domain models great again for monaural speaker separation,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
work page 2023
-
[7]
K. Saijo, G. Wichern, F. G. Germain, Z. Pan, and J. L. Roux, “TF- Locoformer: Transformer with local modeling by convolution for speech separation and enhancement,” in2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC), 2024, pp. 205–209
work page 2024
-
[8]
SPMamba: State-space model is all you need in speech separation,
K. Li and G. Chen, “SPMamba: State-space model is all you need in speech separation,”arXiv preprint arXiv:2404.02063, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
-
[9]
DPT-FSNet: Dual-path Transformer based full-band and sub-band fusion network for speech enhancement,
F. Dang, H. Chen, and P. Zhang, “DPT-FSNet: Dual-path Transformer based full-band and sub-band fusion network for speech enhancement,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6857–6861
work page 2022
-
[10]
CMGAN: Conformer-based Metric- GAN for monaural speech enhancement,
S. Abdulatif, R. Cao, and B. Yang, “CMGAN: Conformer-based Metric- GAN for monaural speech enhancement,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2477–2493, 2024
work page 2024
-
[11]
MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,
Y .-X. Lu, Y . Ai, and Z.-H. Ling, “MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,” inProc. Interspeech, 2023, pp. 3834–3838
work page 2023
-
[12]
An investigation of incorporating Mamba for speech enhancement,
R. Chao, W.-H. Cheng, M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y . Tsao, “An investigation of incorporating Mamba for speech enhancement,”arXiv preprint arXiv:2405.06573, 2024
-
[13]
A comprehensive study of speech separation: Spectrogram vs waveform separation,
F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y . Xu, M. Yu, and D. Yu, “A comprehensive study of speech separation: Spectrogram vs waveform separation,” inProc. Interspeech, 2019, pp. 4574–4578
work page 2019
-
[14]
Multi-channel overlapped speech recognition with location guided speech extraction network,
Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y . Gong, “Multi-channel overlapped speech recognition with location guided speech extraction network,” in2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 558–565
work page 2018
-
[15]
Continuous speech separation with Conformer,
S. Chen, Y . Wu, Z. Chen, J. Wu, J. Li, T. Yoshioka, C. Wang, S. Liu, and M. Zhou, “Continuous speech separation with Conformer,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5749–5753
work page 2021
-
[16]
Multi-microphone neural speech separation for far-field multi-talker speech recognition,
T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multi-microphone neural speech separation for far-field multi-talker speech recognition,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5739–5743
work page 2018
-
[17]
Combining spectral and spatial features for deep learning based blind speaker separation,
Z.-Q. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 457– 468, 2019
work page 2019
-
[18]
Multi-modal multi-channel target speech separation,
R. Gu, S.-X. Zhang, Y . Xu, L. Chen, Y . Zou, and D. Yu, “Multi-modal multi-channel target speech separation,”IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 530–541, 2020
work page 2020
-
[19]
VarArray: Array-geometry-agnostic continuous speech sep- aration,
T. Yoshioka, X. Wang, D. Wang, M. Tang, Z. Zhu, Z. Chen, and N. Kanda, “VarArray: Array-geometry-agnostic continuous speech sep- aration,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6027– 6031
work page 2022
-
[20]
FaSNet: low-latency adaptive beamforming for multi-microphone audio process- ing,
Y . Luo, C. Han, N. Mesgarani, E. Ceolini, and S.-C. Liu, “FaSNet: low-latency adaptive beamforming for multi-microphone audio process- ing,” in2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 260–267
work page 2019
-
[21]
Beam-Guided TasNet: An iterative speech separation framework with multi-channel output,
H. Chen, Y . Yang, F. Dang, and P. Zhang, “Beam-Guided TasNet: An iterative speech separation framework with multi-channel output,” in Proc. Interspeech, 2022, pp. 866–870
work page 2022
-
[22]
TPARN: Triple-path attentive recurrent network for time-domain mul- tichannel speech enhancement,
A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, and D. Wang, “TPARN: Triple-path attentive recurrent network for time-domain mul- tichannel speech enhancement,” inICASSP 2022 - 2022 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6497–6501
work page 2022
-
[23]
ADL- MVDR: All deep learning MVDR beamformer for target speech sep- aration,
Z. Zhang, Y . Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, “ADL- MVDR: All deep learning MVDR beamformer for target speech sep- aration,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6089– 6093
work page 2021
-
[24]
All-neural beamformer for continuous speech separation,
Z. Zhang, T. Yoshioka, N. Kanda, Z. Chen, X. Wang, D. Wang, and S. E. Eskimez, “All-neural beamformer for continuous speech separation,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6032–6036
work page 2022
-
[25]
Generalized spatio-temporal RNN beamformer for target speech separation,
Y . Xu, Z. Zhang, M. Yu, S.-X. Zhang, and D. Yu, “Generalized spatio-temporal RNN beamformer for target speech separation,”Proc. Interspeech, 2021
work page 2021
-
[26]
MIMO self-attentive RNN beamformer for multi-speaker speech separation,
X. Li, Y . Xu, M. Yu, S.-X. Zhang, J. Xu, B. Xu, and D. Yu, “MIMO self-attentive RNN beamformer for multi-speaker speech separation,” in Proc. Interspeech, 2021, pp. 1119–1123
work page 2021
-
[27]
Count and separate: Incorporating speaker counting for continuous speaker separation,
Z.-Q. Wang and D. Wang, “Count and separate: Incorporating speaker counting for continuous speaker separation,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2021, pp. 11–15
work page 2021
-
[28]
Neural spectrospatial filtering,
K. Tan, Z.-Q. Wang, and D. Wang, “Neural spectrospatial filtering,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 605–621, 2022
work page 2022
-
[29]
TF-Gridnet: Integrating full- and sub-band modeling for speech separation,
Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watan- abe, “TF-Gridnet: Integrating full- and sub-band modeling for speech separation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023
work page 2023
-
[30]
C. Quan and X. Li, “SpatialNet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1310–1323, 2024
work page 2024
-
[31]
Separate and reconstruct: Asymmetric encoder-decoder for speech separation,
U.-H. Shin, S. Lee, T. Kim, and H.-M. Park, “Separate and reconstruct: Asymmetric encoder-decoder for speech separation,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024, pp. 52 215–52 240
work page 2024
-
[32]
Multi-microphone complex spectral mapping for speech dereverberation,
Z.-Q. Wang and D. Wang, “Multi-microphone complex spectral mapping for speech dereverberation,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 486–490
work page 2020
-
[33]
Multi-microphone complex spectral mapping for utterance-wise and continuous speech separation,
Z.-Q. Wang, P. Wang, and D. Wang, “Multi-microphone complex spectral mapping for utterance-wise and continuous speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2001–2014, 2021
work page 2001
-
[34]
Multichannel speech enhancement without beamforming,
A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, and D. Wang, “Multichannel speech enhancement without beamforming,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6502–6506
work page 2022
-
[35]
TF-CorrNet: Leveraging spatial correlation for continuous speech separation,
U.-H. Shin, B. H. Ku, and H.-M. Park, “TF-CorrNet: Leveraging spatial correlation for continuous speech separation,”IEEE Signal Processing Letters, vol. 32, pp. 1875–1879, 2025
work page 2025
-
[36]
K. D. Donohue, J. Hannemann, and H. G. Dietz, “Performance of phase transform for detecting sound sources with microphone arrays in reverberant and noisy environments,”Signal Processing, vol. 87, no. 7, pp. 1677–1691, 2007
work page 2007
-
[37]
Deep filter estimation from inter-frame correlations for monaural speech dereverberation,
U.-H. Shin, J. H. Kim, J. Kim, W. Kim, and H.-M. Park, “Deep filter estimation from inter-frame correlations for monaural speech dereverberation,”arXiv preprint arXiv:2603.14986, 2026
-
[38]
Pytorch: An imperative style, high-performance deep learning library,
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, and others, “Pytorch: An imperative style, high-performance deep learning library,”Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019
work page 2019
-
[39]
The generalized correlation method for estimation of time delay,
C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976
work page 1976
-
[40]
Speech dereverberation based on variance-normalized delayed linear prediction,
T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,”IEEE Transactions on Audio, Speech, and Language Pro- cessing, vol. 18, no. 7, pp. 1717–1731, 2010
work page 2010
-
[41]
T. Nakatani, R. Takahashi, T. Ochiai, K. Kinoshita, R. Ikeshita, M. Delcroix, and S. Araki, “DNN-supported mask-based convolutional beamforming for simultaneous denoising, dereverberation, and source separation,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6399–6403
work page 2020
-
[42]
B. J. Cho and H.-M. Park, “Convolutional maximum-likelihood dis- tortionless response beamforming with steering vector estimation for robust speech recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1352–1367, 2021
work page 2021
-
[43]
Deep filtering: Signal extraction and reconstruction using complex time-frequency filters,
W. Mack and E. A. P. Habets, “Deep filtering: Signal extraction and reconstruction using complex time-frequency filters,”IEEE Signal Processing Letters, vol. 27, pp. 61–65, 2020
work page 2020
-
[44]
Leveraging sound localization to improve continuous speaker separation,
H. Taherian, A. Pandey, D. Wong, B. Xu, and D. Wang, “Leveraging sound localization to improve continuous speaker separation,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 621–625
work page 2024
-
[45]
Boosting unknown-number speaker separation with Transformer decoder-based attractor,
Y . Lee, S. Choi, B.-Y . Kim, Z.-Q. Wang, and S. Watanabe, “Boosting unknown-number speaker separation with Transformer decoder-based attractor,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 446–450
work page 2024
-
[46]
Roformer: Enhanced Transformer with rotary position embedding,
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced Transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231223011864
work page 2024
-
[47]
Learning deep Transformer models for machine translation,
Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao, “Learning deep Transformer models for machine translation,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 1810–1822. [Online]. Available: https://aclanthology.org/P19-1176
work page 2019
-
[48]
Transformers without tears: Improving the normalization of self-attention,
T. Q. Nguyen and J. Salazar, “Transformers without tears: Improving the normalization of self-attention,” inProceedings of the 16th International Conference on Spoken Language Translation, J. Niehues, JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13 R. Cattoni, S. St ¨uker, M. Negri, M. Turchi, T.-L. Ha, E. Salesky, R. Sanabria, L. Barrault, L...
work page 2015
-
[49]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: ...
work page 2017
-
[50]
An image is worth 16x16 words: Trans- formers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[51]
Continuous speech separation: Dataset and analysis,
Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y . Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7284–7288
work page 2020
-
[52]
Deep clustering: Discriminative embeddings for segmentation and separation,
J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in2016 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2016, pp. 31–35
work page 2016
-
[53]
Single- channel multi-speaker separation using deep clustering,
Y . Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single- channel multi-speaker separation using deep clustering,” inProc. Inter- speech, 2016, pp. 545–549, iSSN: 2958-1796
work page 2016
-
[54]
V oice separation with an unknown number of multiple speakers,
E. Nachmani, Y . Adi, and L. Wolf, “V oice separation with an unknown number of multiple speakers,” inProceedings of the 37th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, Jul. 2020, pp. 7164–7175. [Online]. Available: https://proceedings.mlr.press/v119/nac...
work page 2020
-
[55]
SDR – Half- baked or well done?
J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half- baked or well done?” inICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630
work page 2019
-
[56]
M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017
work page 1901
-
[57]
Wavesplit: End-to-end speech separation by speaker clustering,
N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2840–2849, 2021
work page 2021
-
[58]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR), 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
-
[59]
WHAMR!: Noisy and reverberant single-channel speech separation,
M. Maciejewski, G. Wichern, E. McQuinn, and J. L. Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 696–700
work page 2020
-
[60]
WHAM!: Extending speech separation to noisy environments,
G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” inProc. Interspeech, 2019, pp. 1368–1372. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2019-2821
-
[61]
gpuRIR: A python library for room impulse response simulation with GPU acceleration,
D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpuRIR: A python library for room impulse response simulation with GPU acceleration,” Multimedia Tools and Applications, vol. 80, no. 4, pp. 5653–5671, Feb
-
[62]
Available: https://doi.org/10.1007/s11042-020-09905-3
[Online]. Available: https://doi.org/10.1007/s11042-020-09905-3
-
[63]
Librispeech: An ASR corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210
work page 2015
-
[64]
Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks,
T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva, “Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks,” inProc. Interspeech, 2018, pp. 3038–3042
work page 2018
-
[65]
C. K. A. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020 deep noise suppression challenge: datasets, subjective testing framework, and challenge results,” inProc. Interspeech, 2020, pp. 2492–2496, iSSN: 2958-1796
work page 2020
-
[66]
Sudo RM -RF: Efficient Networks for Universal Audio Source Separation,
E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo RM -RF: Efficient Networks for Universal Audio Source Separation,” in2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), 2020, pp. 1–6
work page 2020
-
[67]
J. Chen, Q. Mao, and D. Liu, “Dual-Path Transformer network: direct context-aware modeling for end-to-end monaural speech separation,” in Proc. Interspeech, 2020, pp. 2642–2646, iSSN: 2958-1796
work page 2020
-
[68]
Speech separation using an asynchronous fully recurrent convolutional neural network,
X. Hu, K. Li, W. Zhang, Y . Luo, J.-M. Lemercier, and T. Gerkmann, “Speech separation using an asynchronous fully recurrent convolutional neural network,” inAdvances in Neural Information Processing Systems (NeurIPS), M. Ranzato, A. Beygelzimer, Y . Dauphin, P. S. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 22 509– 22 522. ...
work page 2021
-
[69]
SFSRNet: Super-resolution for single-channel audio source separation,
J. Rixen and M. Renz, “SFSRNet: Super-resolution for single-channel audio source separation,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 11 220–11 228, Jun
-
[70]
Available: https://ojs.aaai.org/index.php/AAAI/article/ view/21372
[Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/ view/21372
-
[71]
Z. Mu, X. Yang, and W. Zhu, “Multi-dimensional and multi-scale modeling for speech separation optimized by discriminative learning,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5
work page 2023
-
[72]
QDPN - Quasi-dual-path network for single- channel speech separation,
J. Rixen and M. Renz, “QDPN - Quasi-dual-path network for single- channel speech separation,” inProc. Interspeech, 2022, pp. 5353–5357
work page 2022
-
[73]
S. Zhao and B. Ma, “Mossformer: Pushing the performance limit of monaural speech separation using gated single-head Transformer with convolution-augmented joint self-attentions,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5
work page 2023
-
[74]
S. R. Chetupalli and E. Habets, “Speech separation for an unknown num- ber of speakers using Transformers with encoder-decoder attractors,” in Proc. Interspeech, 2022, pp. 5393–5397, iSSN: 2308-457X
work page 2022
-
[75]
Re- cursive speech separation for unknown number of speakers,
N. Takahashi, S. Parthasaarathy, N. Goswami, and Y . Mitsufuji, “Re- cursive speech separation for unknown number of speakers,” inProc. Interspeech, 2019, pp. 1348–1352, iSSN: 2958-1796
work page 2019
-
[76]
Exploring self-attention mechanisms for speech separation,
C. Subakan, M. Ravanelli, S. Cornell, F. Grondin, and M. Bronzi, “Exploring self-attention mechanisms for speech separation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2169–2180, 2023
work page 2023
-
[77]
S. Zhao, Y . Ma, C. Ni, C. Zhang, H. Wang, T. H. Nguyen, K. Zhou, J. Yip, D. Ng, and B. Ma, “Mossformer2: Combining Transformer and RNN-free recurrent network for enhanced time-domain monaural speech separation,” 2023, eprint: 2312.11825
-
[78]
On end-to-end multi- channel time domain speech separation in reverberant environments,
J. Zhang, C. Zoril ˘a, R. Doddipatla, and J. Barker, “On end-to-end multi- channel time domain speech separation in reverberant environments,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6389–6393
work page 2020
-
[79]
Time-domain speech extraction with spatial information and multi speaker conditioning mechanism,
——, “Time-domain speech extraction with spatial information and multi speaker conditioning mechanism,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6084–6088
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.