pith. sign in

arxiv: 1907.01160 · v1 · pith:FQ2DK3Y2new · submitted 2019-07-02 · 💻 cs.SD · cs.CL· cs.LG· eess.AS· stat.ML

WHAM!: Extending Speech Separation to Noisy Environments

Pith reviewed 2026-05-25 11:07 UTC · model grok-4.3

classification 💻 cs.SD cs.CLcs.LGeess.ASstat.ML
keywords speech separationambient noiseWHAM datasetcocktail party problemaudio robustnessspeaker mixtures
0
0 comments X

The pith

Speech separation models deliver substantial gains over noisy signals even after adding real ambient noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a dataset called WHAM! by adding real background noise recorded in coffee shops, restaurants, and bars to existing two-speaker speech mixtures. It then evaluates several speech separation methods on this data to see how well they handle the added noise. The evaluation shows that noise lowers the quality of separated speech, yet most methods still produce clear improvements when compared to the original noisy recordings. This setup brings speech separation research closer to practical use in everyday noisy places.

Core claim

By constructing the WSJ0 Hipster Ambient Mixtures dataset from wsj0-2mix and real ambient noise, the work demonstrates that separation performance decreases with noise but most approaches still yield substantial gains relative to the noisy signals.

What carries the argument

The WHAM! dataset that mixes two-speaker audio with real-world ambient noise to benchmark separation robustness.

If this is right

  • Separation performance decreases when real ambient noise is added to speaker mixtures.
  • Most tested architectures and objective functions still provide substantial gains over the noisy input signals.
  • The public availability of the ambient noise samples supports further robustness testing.
  • Realistic scenarios require evaluating separation beyond clean overlapping speech conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that perform well on this dataset may transfer more readily to consumer devices used in noisy settings.
  • Combining separation with explicit noise reduction steps could amplify the observed gains.
  • Testing the same methods on noise from additional locations or with more speakers would reveal broader applicability.

Load-bearing premise

Ambient noise recorded in a small number of Bay Area locations captures the range of sounds that affect real-world speech separation.

What would settle it

A separation model that produces no measurable improvement in signal quality metrics on the WHAM! test set compared to the unprocessed noisy mixtures would falsify the claim of substantial gains.

Figures

Figures reproduced from arXiv: 1907.01160 by Dwight Crow, Emmett McQuinn, Ethan Manilow, Gordon Wichern, Joe Antognini, Jonathan Le Roux, Licheng Richard Zhu, Michael Flynn.

Figure 1
Figure 1. Figure 1: Histograms of duration and unique locations where background noise was recorded. 40 30 20 10 0 estimated SNR (dB) 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative histogram over all 10s chunks [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Estimated speech SNR of all background recordings, with -6 dB threshold indicated. (louder) speaker such that the SNR between the first speaker and the noise is equal to the randomly sampled value. The SNR range was chosen by recording conversations in some of the same environments in which the ambient noise was col￾lected, and estimating the relative speech and noise levels. We also examined whether the S… view at source ↗
Figure 3
Figure 3. Figure 3: SI-SDR scatter plots comparing chimera++ perfor￾mance over different datasets. A comparison of the different deep clustering modifications discussed in Section 3 for speech separation in noise are pro￾vided in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Recent progress in separating the speech signals from multiple overlapping speakers using a single audio channel has brought us closer to solving the cocktail party problem. However, most studies in this area use a constrained problem setup, comparing performance when speakers overlap almost completely, at artificially low sampling rates, and with no external background noise. In this paper, we strive to move the field towards more realistic and challenging scenarios. To that end, we created the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset, consisting of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area, and are made publicly available. We benchmark various speech separation architectures and objective functions to evaluate their robustness to noise. While separation performance decreases as a result of noise, we still observe substantial gains relative to the noisy signals for most approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset by augmenting wsj0-2mix two-speaker mixtures with real ambient noise recordings collected from coffee shops, restaurants, and bars in the San Francisco Bay Area. It benchmarks multiple speech separation architectures and objective functions on this dataset, claiming that while separation performance decreases due to noise, most approaches still yield substantial gains relative to the noisy input signals.

Significance. If the noise samples prove representative, the public WHAM! dataset would provide a useful benchmark for evaluating speech separation robustness beyond clean, artificial conditions. The work's strength lies in releasing the collected noise samples and performing a comparative evaluation across models; these elements could help standardize testing for noisy scenarios if the representativeness concern is addressed.

major comments (2)
  1. [WHAM! dataset construction] The claim that WHAM! moves the field toward 'more realistic and challenging scenarios' (abstract) is load-bearing on the assumption that the Bay Area venue recordings are representative. The manuscript provides no quantitative comparison of noise statistics (non-stationarity, SNR distribution, reverberation time, spectral tilt) against other corpora or real deployments, so the observed SI-SDR gains cannot be confidently interpreted as evidence of robustness rather than an artifact of the specific collection conditions.
  2. [Abstract and experimental results] The abstract asserts 'substantial gains relative to the noisy signals for most approaches' but supplies no numerical results, model specifications, or error bars. Without these details in the main text (e.g., tables reporting SI-SDR or equivalent metrics across conditions), the data support for the central empirical claim cannot be verified.
minor comments (2)
  1. [Dataset construction] Clarify the exact mixing procedure, SNR ranges, and any preprocessing applied when combining wsj0-2mix with the new noise samples.
  2. [Benchmarking section] Add references or descriptions for all benchmarked architectures and loss functions to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the WHAM! dataset and its evaluation. We address the two major comments point by point below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [WHAM! dataset construction] The claim that WHAM! moves the field toward 'more realistic and challenging scenarios' (abstract) is load-bearing on the assumption that the Bay Area venue recordings are representative. The manuscript provides no quantitative comparison of noise statistics (non-stationarity, SNR distribution, reverberation time, spectral tilt) against other corpora or real deployments, so the observed SI-SDR gains cannot be confidently interpreted as evidence of robustness rather than an artifact of the specific collection conditions.

    Authors: We agree that the manuscript does not provide quantitative comparisons of the collected noise statistics against other corpora. The noise was recorded in real Bay Area venues to move beyond artificial conditions, but without such comparisons the representativeness claim rests on the collection methodology alone. We will add an analysis of key statistics (e.g., SNR distribution, spectral characteristics, and non-stationarity measures) and compare them to existing noise corpora in a revised version. revision: yes

  2. Referee: [Abstract and experimental results] The abstract asserts 'substantial gains relative to the noisy signals for most approaches' but supplies no numerical results, model specifications, or error bars. Without these details in the main text (e.g., tables reporting SI-SDR or equivalent metrics across conditions), the data support for the central empirical claim cannot be verified.

    Authors: The body of the manuscript contains tables reporting SI-SDR results across models and conditions. However, the abstract does not include specific numerical values or error bars. We will revise the abstract to include key quantitative results (e.g., average SI-SDR improvements) while keeping it concise, and ensure all tables in the main text are clearly referenced with appropriate statistical details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and benchmarking with no derivations or self-referential reductions.

full rationale

The paper creates the WHAM! dataset by mixing wsj0-2mix speech with independently recorded ambient noise and reports benchmark results on separation models. No equations, fitted parameters, or first-principles derivations are present. Claims such as 'substantial gains relative to the noisy signals' are direct empirical observations on the constructed test set, not predictions that reduce to prior fits or self-citations by construction. The wsj0-2mix reference is an external public dataset, not a self-citation load-bearing step. This is a standard empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the newly collected noise samples and the assumption that the mixing procedure yields valid test conditions; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Combining wsj0-2mix with the collected ambient noise samples produces representative conditions for evaluating robustness to real-world noise.
    Invoked when describing dataset creation and the move toward realistic scenarios.

pith-pipeline@v0.9.0 · 5720 in / 1109 out tokens · 36432 ms · 2026-05-25T11:07:28.983973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

    cs.CV 2026-05 unverdicted novelty 7.0

    SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.

  2. Test-Time Adaptation For Speech Enhancement Via Mask Polarization

    eess.AS 2026-01 unverdicted novelty 6.0

    Mask polarization restores bimodality in SE model predictions via Wasserstein distance at test time, delivering consistent gains across domain shifts and architectures.

  3. Time-Varying Deep State Space Models for Sequences with Switching Dynamics

    cs.LG 2026-05 unverdicted novelty 5.0

    A class of time-varying deep state-space model neural networks is proposed that learns dynamics via a dictionary of basis functions evolving differently over time, outperforming time-invariant versions on switching sy...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 3 Pith papers · 2 internal anchors

  1. [1]

    WHAM!: Extending Speech Separation to Noisy Environments

    Introduction The problems of speaker-independent monaural speech en- hancement (separating speech from background noise) and speech separation (separating multiple overlapping speech sig- nals) have progressed greatly with modern deep learning-based techniques [1–9]. While high performing enhancement and sep- aration systems share many common techniques, ...

  2. [2]

    WHAM! dataset 1 The wsj0-2mix dataset [3] is composed of two-speaker mixtures from the Wall Street Journal (WSJ0) corpus, and scripts for cre- ating this dataset are publicly available. The mixtures are cre- ated by applying randomly selected gains in order to achieve relative levels between 0 and 5 dB between the two speech sig- nals prior to mixing in t...

  3. [3]

    Speech separation objective functions Let X ∈ CF×T be the complex spectrogram of a mixture of C sourcesSc∈ CF×T forc = 1,...,C . For simplicity, we focus here mainly on methods that attempt to estimate a real- valued mask for each source ˆMc∈ RF×T by minimizing the truncated phase sensitive approximation (tPSA) objective [2] in a permutation-free manner [...

  4. [4]

    Experimental results The WHAM! dataset allows us to evaluate multiple tasks in a controlled comparable manner. These tasks are: • enhance-single: from a mixture of only the first WSJ0 speaker and noise, recover the signal from the first speaker (typical speech enhancement scenario); • enhance-both: from a mixture of two speakers and noise, recover the mixtu...

  5. [5]

    Initial results show that T-F based separation approaches still perform effec- tively in the presence of noise

    Conclusion To help move the rapidly advancing speech separation field towards more realistic scenarios, we introduced the WHAM! dataset for evaluation of speaker-independent separation in noisy environments, and used it to benchmark several speech enhancement and speech separation approaches. Initial results show that T-F based separation approaches still ...

  6. [6]

    Dis- criminatively trained recurrent neural networks for single-channel speech separation,

    F. J. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, “Dis- criminatively trained recurrent neural networks for single-channel speech separation,” in GlobalSIP Machine Learning Applications in Speech Processing Symposium, Dec. 2014

  7. [7]

    Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,

    H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , Apr. 2015, pp. 708–712

  8. [8]

    Deep clustering: Discriminative embeddings for segmentation and separation,

    J. R. Hershey, Z. Chen, and J. Le Roux, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2016, pp. 31–35

  9. [9]

    Complex ratio mask- ing for monaural speech separation,

    D. S. Williamson, Y . Wang, and D. Wang, “Complex ratio mask- ing for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483– 492, 2016

  10. [10]

    Supervised speech separation based on deep learning: An overview,

    D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702– 1726, 2018

  11. [11]

    Joint separation and denoising of noisy multi-talker speech using recurrent neu- ral networks and permutation invariant training,

    M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Joint separation and denoising of noisy multi-talker speech using recurrent neu- ral networks and permutation invariant training,” in Proc. IEEE International Workshop on Machine Learning for Signal Process- ing (MLSP), Sep. 2017, pp. 1–6

  12. [12]

    Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,

    A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,” in Proc. SIGGRAPH, Aug. 2018

  13. [13]

    Phasebook and friends: Leveraging discrete representa- tions for source separation,

    J. Le Roux, G. Wichern, S. Watanabe, A. Sarroff, and J. R. Her- shey, “Phasebook and friends: Leveraging discrete representa- tions for source separation,” IEEE Journal of Selected Topics in Signal Processing, 2019

  14. [14]

    Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

    Y . Luo and N. Mesgarani, “TasNet: Surpassing ideal time- frequency masking for speech separation,” arXiv preprint arXiv:1809.07454, Sep. 2018

  15. [15]

    Single-channel multi-speaker separation using deep clustering,

    Y . Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” in Proc. ISCA Interspeech, Sep. 2016, pp. 545–549

  16. [16]

    Alternative objective functions for deep clustering,

    Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in Proc. IEEE International Con- ference on Acoustics, Speech, and Signal Processing (ICASSP) , Apr. 2018

  17. [17]

    Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,

    M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Au- dio, Speech and Language Processing, pp. 1901–1913, 2017

  18. [18]

    End-to-end speech separation with unfolded iterative phase reconstruction,

    Z.-Q. Wang, J. Le Roux, D. Wang, and J. R. Hershey, “End-to-end speech separation with unfolded iterative phase reconstruction,” in Proc. ISCA Interspeech, Sep. 2018

  19. [19]

    FurcaNeXt: End-to- end monaural speech separation with dynamic gated dilated tem- poral convolutional networks,

    Z. Shi, H. Lin, L. Liu, R. Liu, and J. Han, “FurcaNeXt: End-to- end monaural speech separation with dynamic gated dilated tem- poral convolutional networks,” arXiv preprint arXiv:1902.04891, 2019

  20. [20]

    Deep learning based phase re- construction for speaker separation: A trigonometric perspective,

    Z.-Q. Wang, K. Tan, and D. Wang, “Deep learning based phase re- construction for speaker separation: A trigonometric perspective,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2019

  21. [21]

    Phase reconstruction with learned time-frequency representations for single-channel speech separa- tion,

    G. Wichern and J. Le Roux, “Phase reconstruction with learned time-frequency representations for single-channel speech separa- tion,” in Proc. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 2018

  22. [22]

    CSR-I (WSJ0) complete LDC93S6A,

    J. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) complete LDC93S6A,” 1993, Web Download. Philadelphia: Lin- guistic Data Consortium

  23. [23]

    Integration of neural networks and probabilistic spatial models for acoustic blind source separa- tion,

    L. Drude and R. Haeb-Umbach, “Integration of neural networks and probabilistic spatial models for acoustic blind source separa- tion,” IEEE Journal of Selected Topics in Signal Processing, 2019

  24. [24]

    Speech production modifications produced by competing talkers, babble, and stationary noise,

    Y . Lu and M. Cooke, “Speech production modifications produced by competing talkers, babble, and stationary noise,” The Journal of the Acoustical Society of America , vol. 124, no. 5, pp. 3261– 3275, 2008

  25. [25]

    Toward a rec- ommendation for a European standard of peak and LKFS loud- ness levels,

    E. Grimm, R. Van Everdingen, and M. Sch ¨opping, “Toward a rec- ommendation for a European standard of peak and LKFS loud- ness levels,” SMPTE Motion Imaging Journal, vol. 119, no. 3, pp. 28–34, 2010

  26. [26]

    SDR – half-baked or well done?

    J. Le Roux, S. T. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked or well done?” in Proc. IEEE International Confer- ence on Acoustics, Speech, and Signal Processing (ICASSP), May 2019

  27. [27]

    Teacher-student deep clustering for low-delay channel speech separation,

    R. Aihara, T. Hanazawa, Y . Okato, G. Wichern, and J. Le Roux, “Teacher-student deep clustering for low-delay channel speech separation,” in Proc. IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , May 2019

  28. [28]

    TasNet: Time-domain audio separa- tion network for real-time, single-channel speech separation,

    Y . Luo and N. Mesgarani, “TasNet: Time-domain audio separa- tion network for real-time, single-channel speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018

  29. [29]

    Tem- poral convolutional networks for action segmentation and detec- tion,

    C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Tem- poral convolutional networks for action segmentation and detec- tion,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017