WHAM!: Extending Speech Separation to Noisy Environments
Pith reviewed 2026-05-25 11:07 UTC · model grok-4.3
The pith
Speech separation models deliver substantial gains over noisy signals even after adding real ambient noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing the WSJ0 Hipster Ambient Mixtures dataset from wsj0-2mix and real ambient noise, the work demonstrates that separation performance decreases with noise but most approaches still yield substantial gains relative to the noisy signals.
What carries the argument
The WHAM! dataset that mixes two-speaker audio with real-world ambient noise to benchmark separation robustness.
If this is right
- Separation performance decreases when real ambient noise is added to speaker mixtures.
- Most tested architectures and objective functions still provide substantial gains over the noisy input signals.
- The public availability of the ambient noise samples supports further robustness testing.
- Realistic scenarios require evaluating separation beyond clean overlapping speech conditions.
Where Pith is reading between the lines
- Models that perform well on this dataset may transfer more readily to consumer devices used in noisy settings.
- Combining separation with explicit noise reduction steps could amplify the observed gains.
- Testing the same methods on noise from additional locations or with more speakers would reveal broader applicability.
Load-bearing premise
Ambient noise recorded in a small number of Bay Area locations captures the range of sounds that affect real-world speech separation.
What would settle it
A separation model that produces no measurable improvement in signal quality metrics on the WHAM! test set compared to the unprocessed noisy mixtures would falsify the claim of substantial gains.
Figures
read the original abstract
Recent progress in separating the speech signals from multiple overlapping speakers using a single audio channel has brought us closer to solving the cocktail party problem. However, most studies in this area use a constrained problem setup, comparing performance when speakers overlap almost completely, at artificially low sampling rates, and with no external background noise. In this paper, we strive to move the field towards more realistic and challenging scenarios. To that end, we created the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset, consisting of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area, and are made publicly available. We benchmark various speech separation architectures and objective functions to evaluate their robustness to noise. While separation performance decreases as a result of noise, we still observe substantial gains relative to the noisy signals for most approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset by augmenting wsj0-2mix two-speaker mixtures with real ambient noise recordings collected from coffee shops, restaurants, and bars in the San Francisco Bay Area. It benchmarks multiple speech separation architectures and objective functions on this dataset, claiming that while separation performance decreases due to noise, most approaches still yield substantial gains relative to the noisy input signals.
Significance. If the noise samples prove representative, the public WHAM! dataset would provide a useful benchmark for evaluating speech separation robustness beyond clean, artificial conditions. The work's strength lies in releasing the collected noise samples and performing a comparative evaluation across models; these elements could help standardize testing for noisy scenarios if the representativeness concern is addressed.
major comments (2)
- [WHAM! dataset construction] The claim that WHAM! moves the field toward 'more realistic and challenging scenarios' (abstract) is load-bearing on the assumption that the Bay Area venue recordings are representative. The manuscript provides no quantitative comparison of noise statistics (non-stationarity, SNR distribution, reverberation time, spectral tilt) against other corpora or real deployments, so the observed SI-SDR gains cannot be confidently interpreted as evidence of robustness rather than an artifact of the specific collection conditions.
- [Abstract and experimental results] The abstract asserts 'substantial gains relative to the noisy signals for most approaches' but supplies no numerical results, model specifications, or error bars. Without these details in the main text (e.g., tables reporting SI-SDR or equivalent metrics across conditions), the data support for the central empirical claim cannot be verified.
minor comments (2)
- [Dataset construction] Clarify the exact mixing procedure, SNR ranges, and any preprocessing applied when combining wsj0-2mix with the new noise samples.
- [Benchmarking section] Add references or descriptions for all benchmarked architectures and loss functions to ensure reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the WHAM! dataset and its evaluation. We address the two major comments point by point below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [WHAM! dataset construction] The claim that WHAM! moves the field toward 'more realistic and challenging scenarios' (abstract) is load-bearing on the assumption that the Bay Area venue recordings are representative. The manuscript provides no quantitative comparison of noise statistics (non-stationarity, SNR distribution, reverberation time, spectral tilt) against other corpora or real deployments, so the observed SI-SDR gains cannot be confidently interpreted as evidence of robustness rather than an artifact of the specific collection conditions.
Authors: We agree that the manuscript does not provide quantitative comparisons of the collected noise statistics against other corpora. The noise was recorded in real Bay Area venues to move beyond artificial conditions, but without such comparisons the representativeness claim rests on the collection methodology alone. We will add an analysis of key statistics (e.g., SNR distribution, spectral characteristics, and non-stationarity measures) and compare them to existing noise corpora in a revised version. revision: yes
-
Referee: [Abstract and experimental results] The abstract asserts 'substantial gains relative to the noisy signals for most approaches' but supplies no numerical results, model specifications, or error bars. Without these details in the main text (e.g., tables reporting SI-SDR or equivalent metrics across conditions), the data support for the central empirical claim cannot be verified.
Authors: The body of the manuscript contains tables reporting SI-SDR results across models and conditions. However, the abstract does not include specific numerical values or error bars. We will revise the abstract to include key quantitative results (e.g., average SI-SDR improvements) while keeping it concise, and ensure all tables in the main text are clearly referenced with appropriate statistical details. revision: yes
Circularity Check
No circularity: empirical dataset construction and benchmarking with no derivations or self-referential reductions.
full rationale
The paper creates the WHAM! dataset by mixing wsj0-2mix speech with independently recorded ambient noise and reports benchmark results on separation models. No equations, fitted parameters, or first-principles derivations are present. Claims such as 'substantial gains relative to the noisy signals' are direct empirical observations on the constructed test set, not predictions that reduce to prior fits or self-citations by construction. The wsj0-2mix reference is an external public dataset, not a self-citation load-bearing step. This is a standard empirical contribution with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Combining wsj0-2mix with the collected ambient noise samples produces representative conditions for evaluating robustness to real-world noise.
Forward citations
Cited by 3 Pith papers
-
SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification
SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.
-
Test-Time Adaptation For Speech Enhancement Via Mask Polarization
Mask polarization restores bimodality in SE model predictions via Wasserstein distance at test time, delivering consistent gains across domain shifts and architectures.
-
Time-Varying Deep State Space Models for Sequences with Switching Dynamics
A class of time-varying deep state-space model neural networks is proposed that learns dynamics via a dictionary of basis functions evolving differently over time, outperforming time-invariant versions on switching sy...
Reference graph
Works this paper leans on
-
[1]
WHAM!: Extending Speech Separation to Noisy Environments
Introduction The problems of speaker-independent monaural speech en- hancement (separating speech from background noise) and speech separation (separating multiple overlapping speech sig- nals) have progressed greatly with modern deep learning-based techniques [1–9]. While high performing enhancement and sep- aration systems share many common techniques, ...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[2]
WHAM! dataset 1 The wsj0-2mix dataset [3] is composed of two-speaker mixtures from the Wall Street Journal (WSJ0) corpus, and scripts for cre- ating this dataset are publicly available. The mixtures are cre- ated by applying randomly selected gains in order to achieve relative levels between 0 and 5 dB between the two speech sig- nals prior to mixing in t...
-
[3]
Speech separation objective functions Let X ∈ CF×T be the complex spectrogram of a mixture of C sourcesSc∈ CF×T forc = 1,...,C . For simplicity, we focus here mainly on methods that attempt to estimate a real- valued mask for each source ˆMc∈ RF×T by minimizing the truncated phase sensitive approximation (tPSA) objective [2] in a permutation-free manner [...
-
[4]
Experimental results The WHAM! dataset allows us to evaluate multiple tasks in a controlled comparable manner. These tasks are: • enhance-single: from a mixture of only the first WSJ0 speaker and noise, recover the signal from the first speaker (typical speech enhancement scenario); • enhance-both: from a mixture of two speakers and noise, recover the mixtu...
-
[5]
Conclusion To help move the rapidly advancing speech separation field towards more realistic scenarios, we introduced the WHAM! dataset for evaluation of speaker-independent separation in noisy environments, and used it to benchmark several speech enhancement and speech separation approaches. Initial results show that T-F based separation approaches still ...
-
[6]
Dis- criminatively trained recurrent neural networks for single-channel speech separation,
F. J. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, “Dis- criminatively trained recurrent neural networks for single-channel speech separation,” in GlobalSIP Machine Learning Applications in Speech Processing Symposium, Dec. 2014
work page 2014
-
[7]
Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,
H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , Apr. 2015, pp. 708–712
work page 2015
-
[8]
Deep clustering: Discriminative embeddings for segmentation and separation,
J. R. Hershey, Z. Chen, and J. Le Roux, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2016, pp. 31–35
work page 2016
-
[9]
Complex ratio mask- ing for monaural speech separation,
D. S. Williamson, Y . Wang, and D. Wang, “Complex ratio mask- ing for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483– 492, 2016
work page 2016
-
[10]
Supervised speech separation based on deep learning: An overview,
D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702– 1726, 2018
work page 2018
-
[11]
M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Joint separation and denoising of noisy multi-talker speech using recurrent neu- ral networks and permutation invariant training,” in Proc. IEEE International Workshop on Machine Learning for Signal Process- ing (MLSP), Sep. 2017, pp. 1–6
work page 2017
-
[12]
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,” in Proc. SIGGRAPH, Aug. 2018
work page 2018
-
[13]
Phasebook and friends: Leveraging discrete representa- tions for source separation,
J. Le Roux, G. Wichern, S. Watanabe, A. Sarroff, and J. R. Her- shey, “Phasebook and friends: Leveraging discrete representa- tions for source separation,” IEEE Journal of Selected Topics in Signal Processing, 2019
work page 2019
-
[14]
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
Y . Luo and N. Mesgarani, “TasNet: Surpassing ideal time- frequency masking for speech separation,” arXiv preprint arXiv:1809.07454, Sep. 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Single-channel multi-speaker separation using deep clustering,
Y . Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” in Proc. ISCA Interspeech, Sep. 2016, pp. 545–549
work page 2016
-
[16]
Alternative objective functions for deep clustering,
Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in Proc. IEEE International Con- ference on Acoustics, Speech, and Signal Processing (ICASSP) , Apr. 2018
work page 2018
-
[17]
M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Au- dio, Speech and Language Processing, pp. 1901–1913, 2017
work page 1901
-
[18]
End-to-end speech separation with unfolded iterative phase reconstruction,
Z.-Q. Wang, J. Le Roux, D. Wang, and J. R. Hershey, “End-to-end speech separation with unfolded iterative phase reconstruction,” in Proc. ISCA Interspeech, Sep. 2018
work page 2018
-
[19]
Z. Shi, H. Lin, L. Liu, R. Liu, and J. Han, “FurcaNeXt: End-to- end monaural speech separation with dynamic gated dilated tem- poral convolutional networks,” arXiv preprint arXiv:1902.04891, 2019
-
[20]
Deep learning based phase re- construction for speaker separation: A trigonometric perspective,
Z.-Q. Wang, K. Tan, and D. Wang, “Deep learning based phase re- construction for speaker separation: A trigonometric perspective,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2019
work page 2019
-
[21]
G. Wichern and J. Le Roux, “Phase reconstruction with learned time-frequency representations for single-channel speech separa- tion,” in Proc. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 2018
work page 2018
-
[22]
CSR-I (WSJ0) complete LDC93S6A,
J. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) complete LDC93S6A,” 1993, Web Download. Philadelphia: Lin- guistic Data Consortium
work page 1993
-
[23]
L. Drude and R. Haeb-Umbach, “Integration of neural networks and probabilistic spatial models for acoustic blind source separa- tion,” IEEE Journal of Selected Topics in Signal Processing, 2019
work page 2019
-
[24]
Speech production modifications produced by competing talkers, babble, and stationary noise,
Y . Lu and M. Cooke, “Speech production modifications produced by competing talkers, babble, and stationary noise,” The Journal of the Acoustical Society of America , vol. 124, no. 5, pp. 3261– 3275, 2008
work page 2008
-
[25]
Toward a rec- ommendation for a European standard of peak and LKFS loud- ness levels,
E. Grimm, R. Van Everdingen, and M. Sch ¨opping, “Toward a rec- ommendation for a European standard of peak and LKFS loud- ness levels,” SMPTE Motion Imaging Journal, vol. 119, no. 3, pp. 28–34, 2010
work page 2010
-
[26]
SDR – half-baked or well done?
J. Le Roux, S. T. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked or well done?” in Proc. IEEE International Confer- ence on Acoustics, Speech, and Signal Processing (ICASSP), May 2019
work page 2019
-
[27]
Teacher-student deep clustering for low-delay channel speech separation,
R. Aihara, T. Hanazawa, Y . Okato, G. Wichern, and J. Le Roux, “Teacher-student deep clustering for low-delay channel speech separation,” in Proc. IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , May 2019
work page 2019
-
[28]
TasNet: Time-domain audio separa- tion network for real-time, single-channel speech separation,
Y . Luo and N. Mesgarani, “TasNet: Time-domain audio separa- tion network for real-time, single-channel speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018
work page 2018
-
[29]
Tem- poral convolutional networks for action segmentation and detec- tion,
C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Tem- poral convolutional networks for action segmentation and detec- tion,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.