Test-Time Adaptation For Speech Enhancement Via Mask Polarization

Bin Yang; Erfan Amini; Tobias Raichle

arxiv: 2601.14770 · v1 · pith:OVMRE2JQnew · submitted 2026-01-21 · 📡 eess.AS

Test-Time Adaptation For Speech Enhancement Via Mask Polarization

Tobias Raichle , Erfan Amini , Bin Yang This is my paper

Pith reviewed 2026-05-21 16:10 UTC · model grok-4.3

classification 📡 eess.AS

keywords speech enhancementtest-time adaptationmask polarizationWasserstein distancedomain shiftbimodal masksedge deploymentaudio processing

0 comments

The pith

Restoring bimodal masks via Wasserstein distance comparison adapts speech enhancement models at test time without added parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how speech enhancement models lose performance when deployed in new acoustic environments. It identifies that predicted masks flatten and lose their sharp separation between speech and noise under such shifts. The proposed solution, mask polarization, counters this by comparing the current mask distribution to a target bimodal one using the Wasserstein distance and adjusting the model outputs accordingly. This adjustment happens entirely at test time using only the existing trained model, with no extra parameters or training data required. Experiments across multiple domain shifts and model architectures show steady improvements that rival more elaborate adaptation techniques.

Core claim

Mask-based speech enhancement models degrade under domain shifts because their output masks lose bimodality and become flattened, reducing decisive speech preservation and noise suppression; restoring bimodality through Wasserstein-distance distribution comparison at test time improves enhancement performance consistently while requiring no additional model parameters.

What carries the argument

Mask polarization (MPol), a test-time adaptation procedure that aligns the distribution of predicted masks to an ideal bimodal form by minimizing Wasserstein distance, thereby recovering confidence in speech versus noise decisions.

If this is right

MPol delivers consistent gains across a range of domain shifts and model architectures without retraining.
The method remains competitive with far more complex adaptation strategies while using only the original trained parameters.
Because it adds no parameters, MPol fits directly into resource-limited edge devices for real-time speech enhancement.
The approach operates solely on the model's existing outputs during inference, avoiding any need for source or target domain data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar polarization steps could extend to other mask-driven audio tasks such as source separation when facing acoustic changes.
If mask flattening proves a general signature of domain shift, the same Wasserstein alignment might transfer to vision models that output soft masks or attention maps.
A direct test would apply MPol to models trained on clean data and measure recovery on real-world recordings with unseen noise profiles or reverberation.

Load-bearing premise

The assumption that flattened masks are the primary cause of performance drop under domain shifts and that forcing bimodality through distribution matching will reliably recover good enhancement quality.

What would settle it

An experiment on a domain shift where masks remain bimodal yet enhancement performance still drops, or where applying the polarization step yields no gain or a loss, would contradict the central mechanism.

read the original abstract

Adapting speech enhancement (SE) models to unseen environments is crucial for practical deployments, yet test-time adaptation (TTA) for SE remains largely under-explored due to a lack of understanding of how SE models degrade under domain shifts. We observe that mask-based SE models lose confidence under domain shifts, with predicted masks becoming flattened and losing decisive speech preservation and noise suppression. Based on this insight, we propose mask polarization (MPol), a lightweight TTA method that restores mask bimodality through distribution comparison using the Wasserstein distance. MPol requires no additional parameters beyond the trained model, making it suitable for resource-constrained edge deployments. Experimental results across diverse domain shifts and architectures demonstrate that MPol achieves very consistent gains that are competitive with significantly more complex approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a parameter-free TTA trick for speech enhancement that polarizes flattened masks with Wasserstein distance, but the causal evidence tying that step to the gains still needs checking.

read the letter

Hi, the main point here is a simple test-time adaptation method for mask-based speech enhancement. They observe that domain shifts make the output masks flatter and less decisive, then fix it by comparing the mask distribution to a bimodal target using Wasserstein distance. No extra parameters, which fits edge use cases well. What they do well is start from a concrete model behavior instead of a generic regularizer and keep the whole thing lightweight while claiming gains that hold across different shifts and model architectures. If the full experiments show those gains with proper controls, the idea is practical and worth knowing about for anyone deploying SE models. The softer spot is the missing link between the flattening observation and actual performance drops. The abstract does not report any direct correlation between mask entropy or variance and metrics like PESQ or STOI, and there is no ablation that isolates the polarization step from simpler test-time tricks such as thresholding or gradient stopping. The stress-test note flags exactly this, and until the paper's tables and figures address it the mechanism feels under-supported. This is aimed at the speech enhancement and audio adaptation crowd rather than a broad audience. A reader already working on robustness in deployed audio systems would get the most out of it. The work shows clear enough thinking and a reproducible setup that it deserves a serious referee instead of a desk reject. I would recommend sending it through peer review with a request for the correlation plots and the ablation that tests whether Wasserstein is load-bearing.

Referee Report

3 major / 1 minor

Summary. The manuscript observes that mask-based speech enhancement models lose confidence under domain shifts, producing flattened masks that lose bimodality and decisive speech/noise decisions. It proposes Mask Polarization (MPol), a lightweight test-time adaptation method that restores bimodality by minimizing the Wasserstein distance between the predicted mask distribution and a target bimodal distribution. MPol requires no additional parameters, and the authors report consistent performance gains across diverse domain shifts and model architectures that are competitive with more complex TTA approaches.

Significance. If the central mechanism is validated with quantitative evidence, the work would address an under-explored area by offering a simple, parameter-free TTA technique suitable for resource-constrained deployments. The approach builds on an observed property of model outputs rather than introducing new trainable components, which could be a practical strength if the performance lift is shown to stem specifically from polarization rather than incidental regularization effects.

major comments (3)

[Abstract] Abstract: the claim of 'very consistent gains' and competitiveness with complex methods is presented without any quantitative results, error bars, dataset descriptions, or ablation details, leaving the link between the Wasserstein polarization step and objective improvements (PESQ/STOI) unverified.
[Introduction / Method] The central premise (observation of mask flattening under shift and its restoration via Wasserstein distance) lacks a shown causal link to performance degradation; no quantitative correlation is reported between mask entropy/variance and PESQ/STOI drops, nor is there an ablation isolating the polarization step from other potential factors such as implicit gradient stopping or batch-norm adaptation.
[Experiments] Experiments section: without an ablation that removes the Wasserstein term while retaining other test-time operations, it remains unclear whether the specific distribution-comparison construction is load-bearing for the reported gains or whether simpler thresholding would suffice.

minor comments (1)

[Method] The abstract and method description would benefit from explicit notation for the target bimodal distribution and the precise form of the Wasserstein loss used during adaptation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work regarding mask polarization for test-time adaptation in speech enhancement. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'very consistent gains' and competitiveness with complex methods is presented without any quantitative results, error bars, dataset descriptions, or ablation details, leaving the link between the Wasserstein polarization step and objective improvements (PESQ/STOI) unverified.

Authors: We agree the abstract is high-level and would benefit from concrete support. The full manuscript's Experiments section reports results on multiple datasets (including VoiceBank-DEMAND and domain-shifted variants) across architectures, with PESQ and STOI improvements, error bars from multiple runs, and direct comparisons to more complex TTA baselines showing competitive or superior performance. To strengthen the link, we will revise the abstract to include specific quantitative highlights such as average gains and dataset references. revision: yes
Referee: [Introduction / Method] The central premise (observation of mask flattening under shift and its restoration via Wasserstein distance) lacks a shown causal link to performance degradation; no quantitative correlation is reported between mask entropy/variance and PESQ/STOI drops, nor is there an ablation isolating the polarization step from other potential factors such as implicit gradient stopping or batch-norm adaptation.

Authors: The manuscript includes qualitative evidence in Section 3 with mask histograms and examples demonstrating flattening under shifts and restoration via Wasserstein minimization, alongside performance recovery in experiments. However, we did not report explicit correlation metrics between mask entropy/variance and PESQ/STOI. We will add this analysis in revision. For isolating effects, while comparisons to baselines are present, we will include an ablation disabling the Wasserstein term (retaining test-time forward passes) to rule out incidental factors like batch-norm updates. revision: partial
Referee: [Experiments] Experiments section: without an ablation that removes the Wasserstein term while retaining other test-time operations, it remains unclear whether the specific distribution-comparison construction is load-bearing for the reported gains or whether simpler thresholding would suffice.

Authors: This is a fair point on isolating the mechanism. Our approach specifically uses Wasserstein distance to a bimodal target rather than generic operations, and experiments compare against non-adaptive and alternative TTA methods. To directly address whether simpler thresholding suffices, we will add an ablation in the revised Experiments section that applies test-time thresholding without the distribution comparison and shows inferior results, confirming the load-bearing role of the polarization construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation drives a direct Wasserstein application

full rationale

The paper's chain begins with an empirical observation (mask flattening under domain shift) and applies a standard Wasserstein distance to restore bimodality. No equations or claims reduce a performance gain or 'prediction' to a fitted parameter defined inside the same derivation. No self-citation is load-bearing for the central premise, and the method introduces no ansatz or uniqueness theorem that loops back to the authors' prior work. The result is an experimental method rather than a self-contained derivation that is equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on one domain observation about mask behavior and the standard mathematical properties of the Wasserstein distance; no new free parameters or invented entities are introduced.

axioms (1)

domain assumption Mask-based SE models lose confidence under domain shifts, with predicted masks becoming flattened and losing decisive speech preservation and noise suppression.
This observation, stated in the abstract, directly motivates the polarization step.

pith-pipeline@v0.9.0 · 5657 in / 1216 out tokens · 37462 ms · 2026-05-21T16:10:37.445422+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

[1]

Test-Time Adaptation For Speech Enhancement Via Mask Polarization

INTRODUCTION By leveraging large labeled datasets to learn the complex struc- ture of speech, deep learning based speech enhancement (SE) has revolutionized the field. However, these methods often suf- fer from performance degradation when deployed in environ- ments that differ from their training conditions [1]. As practical SE systems must handle divers...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

The most common approach identifies similar source sample to use as pseudo-labels [5, 6], but requires access to source data, defying the TTA paradigm

RELA TED WORK As SE models frequently encounter unseen target domains where labeled data is unavailable, several previous works have explored applying UDA to SE. The most common approach identifies similar source sample to use as pseudo-labels [5, 6], but requires access to source data, defying the TTA paradigm. RemixIT [3] performs TTA by using a teacher...

work page
[3]

We observe that mask-based SE models exhibit a funda- mental change in their prediction characteristic under domain shifts

METHODOLOGY To address this gap, we investigate whether SE models exhibit analogous confidence degradation under domain shifts and pro- pose mask polarization (MPol), a lightweight TTA method that adapts SE models by restoring ideal TF mask characteristics. We observe that mask-based SE models exhibit a funda- mental change in their prediction characteris...

work page
[4]

Datasets All models were trained on the source dataset EARS-WHAM! (EARS-W) [ 9, 10] and evaluated on the 9 target datasets proposed by [ 1] to cover a wide range of domain shifts

EXPERIMENTS 4.1. Datasets All models were trained on the source dataset EARS-WHAM! (EARS-W) [ 9, 10] and evaluated on the 9 target datasets proposed by [ 1] to cover a wide range of domain shifts. The EARS-DEMAND (EARS-D) [9, 11] dataset covers do- main shifts in only the noisy environment. Analogously, V oiceBank-WHAM! (VBW) [12, 10] represents a shift o...

work page
[5]

Results Analysis Table 1 presents the results averaged across all target datasets for the AM architecture

RESULTS 5.1. Results Analysis Table 1 presents the results averaged across all target datasets for the AM architecture. Despite not introducing any addi- tional parameter overhead, MPol achieves competitive perfor- mance across both perceptual and signal-level metrics. While MPol is not able to match LaDen’s exceptional PESQ per- formance, it approximatel...

work page arXiv
[6]

CONCLUSION We presented MPol, a lightweight TTA method for speech enhancement that achieves competitive performance across diverse architectures without requiring any additional compo- nents. Our key observation that mask-based SE models univer- sally lose bimodal characteristics under domain shifts provides a natural adaptation signal that can be efficie...

work page
[7]

Test-time adaptation for speech enhancement via domain invari- ant embedding transformation,

Tobias Raichle, Niels Edinger, and Bin Yang, “Test-time adaptation for speech enhancement via domain invari- ant embedding transformation,”IEEE Open Journal of Signal Processing, pp. 1–10, 2026

work page 2026
[8]

Tent: Fully Test-Time Adaptation by Entropy Minimization,

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell, “Tent: Fully Test-Time Adaptation by Entropy Minimization,” inInternational Conference on Learning Representations, 2020

work page 2020
[9]

RemixIT: Con- tinual self-training of speech enhancement models via bootstrapped remixing,

Efthymios Tzinis, Yossi Adi, Vamsi K Ithapu, Buye Xu, Paris Smaragdis, and Anurag Kumar, “RemixIT: Con- tinual self-training of speech enhancement models via bootstrapped remixing,”IEEE Journal of Selected Top- ics in Signal Processing, vol. 16, no. 6, pp. 1329–1341, 2022

work page 2022
[10]

Loizou,Speech Enhancement: Theory and Practice, CRC Press, Inc., USA, 2nd edition, 2013

Philipos C. Loizou,Speech Enhancement: Theory and Practice, CRC Press, Inc., USA, 2nd edition, 2013

work page 2013
[11]

Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport,

Hsin-Yi Lin, Huan-Hsin Tseng, Xugang Lu, and Yu Tsao, “Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport,”Advances in Neural Information Processing Systems, vol. 34, pp. 19935–19946, 2021

work page 2021
[12]

Leveraging self-supervised speech representations for domain adaptation in speech enhance- ment,

Ching Hua Lee et al., “Leveraging self-supervised speech representations for domain adaptation in speech enhance- ment,” inICASSP, 2024, pp. 10831–10835

work page 2024
[13]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

Sanyuan Chen et al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[14]

Uni- versal test-time adaptation through weight ensembling, diversity weighting, and prior correction,

Robert A Marsden, Mario Döbler, and Bin Yang, “Uni- versal test-time adaptation through weight ensembling, diversity weighting, and prior correction,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2555–2565

work page 2024
[15]

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,

Julius Richter et al., “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,” inISCA Interspeech, 2024, pp. 4873–4877

work page 2024
[16]

WHAM!: Extending Speech Separation to Noisy Environments

Gordon Wichern et al., “WHAM!: Extending speech separation to noisy environments,”arXiv preprint arXiv:1907.01160, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[17]

The Diverse Environments Multi-Channel Acous- tic Noise Database (DEMAND): A database of multi- channel environmental noise recordings,

Joachim Thiemann, Nobutaka Ito, and Emmanuel Vin- cent, “The Diverse Environments Multi-Channel Acous- tic Noise Database (DEMAND): A database of multi- channel environmental noise recordings,”The Journal of the Acoustical Society of America, vol. 133, pp. 3591, 05 2013

work page 2013
[18]

The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,

Christophe Veaux, Junichi Yamagishi, and Simon King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in2013 International Conference Orien- tal COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O- COCOSDA/CASLRE), 2013, pp. 1–4

work page 2013
[19]

Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech,

Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech,” in9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 2016, pp. 146–152

work page 2016
[20]

ICASSP 2022 Deep Noise Suppression Challenge,

Harishchandra Dubey et al., “ICASSP 2022 Deep Noise Suppression Challenge,” inICASSP, 2022

work page 2022
[21]

CMGAN: Conformer-based Metric GAN for Speech Enhancement,

Ruizhe Cao, Sherif Abdulatif, and Bin Yang, “CMGAN: Conformer-based Metric GAN for Speech Enhancement,” inProc. Interspeech 2022, 2022, pp. 936–940

work page 2022
[22]

MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,

Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin, “MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” inInternational Conference on Machine Learning. PmLR, 2019, pp. 2031–2041

work page 2019
[23]

Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assess- ment of telephone networks and codecs,

Antony W. Rix, John G. Beerends, Mike Hollier, and Andries P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assess- ment of telephone networks and codecs,”2001 IEEE International Conference on Acoustics, Speech, and Sig- nal Processing. Proceedings (Cat. No.01CH37221), vol. 2, pp. 749–752 vol.2, 2001

work page 2001
[24]

Evaluation of objective quality measures for speech enhancement,

Yi Hu and Philipos C. Loizou, “Evaluation of objective quality measures for speech enhancement,”IEEE Trans- actions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229–238, 2008

work page 2008
[25]

The PESQetarian: On the relevance of goodhart’s law for speech enhancement,

Danilo de Oliveira, Simon Welker, Julius Richter, and Timo Gerkmann, “The PESQetarian: On the relevance of goodhart’s law for speech enhancement,” inProc. Interspeech 2024, 2024, pp. 3854–3858

work page 2024
[26]

Decoupled weight decay regularization,

Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

work page 2019

[1] [1]

Test-Time Adaptation For Speech Enhancement Via Mask Polarization

INTRODUCTION By leveraging large labeled datasets to learn the complex struc- ture of speech, deep learning based speech enhancement (SE) has revolutionized the field. However, these methods often suf- fer from performance degradation when deployed in environ- ments that differ from their training conditions [1]. As practical SE systems must handle divers...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

The most common approach identifies similar source sample to use as pseudo-labels [5, 6], but requires access to source data, defying the TTA paradigm

RELA TED WORK As SE models frequently encounter unseen target domains where labeled data is unavailable, several previous works have explored applying UDA to SE. The most common approach identifies similar source sample to use as pseudo-labels [5, 6], but requires access to source data, defying the TTA paradigm. RemixIT [3] performs TTA by using a teacher...

work page

[3] [3]

We observe that mask-based SE models exhibit a funda- mental change in their prediction characteristic under domain shifts

METHODOLOGY To address this gap, we investigate whether SE models exhibit analogous confidence degradation under domain shifts and pro- pose mask polarization (MPol), a lightweight TTA method that adapts SE models by restoring ideal TF mask characteristics. We observe that mask-based SE models exhibit a funda- mental change in their prediction characteris...

work page

[4] [4]

Datasets All models were trained on the source dataset EARS-WHAM! (EARS-W) [ 9, 10] and evaluated on the 9 target datasets proposed by [ 1] to cover a wide range of domain shifts

EXPERIMENTS 4.1. Datasets All models were trained on the source dataset EARS-WHAM! (EARS-W) [ 9, 10] and evaluated on the 9 target datasets proposed by [ 1] to cover a wide range of domain shifts. The EARS-DEMAND (EARS-D) [9, 11] dataset covers do- main shifts in only the noisy environment. Analogously, V oiceBank-WHAM! (VBW) [12, 10] represents a shift o...

work page

[5] [5]

Results Analysis Table 1 presents the results averaged across all target datasets for the AM architecture

RESULTS 5.1. Results Analysis Table 1 presents the results averaged across all target datasets for the AM architecture. Despite not introducing any addi- tional parameter overhead, MPol achieves competitive perfor- mance across both perceptual and signal-level metrics. While MPol is not able to match LaDen’s exceptional PESQ per- formance, it approximatel...

work page arXiv

[6] [6]

CONCLUSION We presented MPol, a lightweight TTA method for speech enhancement that achieves competitive performance across diverse architectures without requiring any additional compo- nents. Our key observation that mask-based SE models univer- sally lose bimodal characteristics under domain shifts provides a natural adaptation signal that can be efficie...

work page

[7] [7]

Test-time adaptation for speech enhancement via domain invari- ant embedding transformation,

Tobias Raichle, Niels Edinger, and Bin Yang, “Test-time adaptation for speech enhancement via domain invari- ant embedding transformation,”IEEE Open Journal of Signal Processing, pp. 1–10, 2026

work page 2026

[8] [8]

Tent: Fully Test-Time Adaptation by Entropy Minimization,

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell, “Tent: Fully Test-Time Adaptation by Entropy Minimization,” inInternational Conference on Learning Representations, 2020

work page 2020

[9] [9]

RemixIT: Con- tinual self-training of speech enhancement models via bootstrapped remixing,

Efthymios Tzinis, Yossi Adi, Vamsi K Ithapu, Buye Xu, Paris Smaragdis, and Anurag Kumar, “RemixIT: Con- tinual self-training of speech enhancement models via bootstrapped remixing,”IEEE Journal of Selected Top- ics in Signal Processing, vol. 16, no. 6, pp. 1329–1341, 2022

work page 2022

[10] [10]

Loizou,Speech Enhancement: Theory and Practice, CRC Press, Inc., USA, 2nd edition, 2013

Philipos C. Loizou,Speech Enhancement: Theory and Practice, CRC Press, Inc., USA, 2nd edition, 2013

work page 2013

[11] [11]

Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport,

Hsin-Yi Lin, Huan-Hsin Tseng, Xugang Lu, and Yu Tsao, “Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport,”Advances in Neural Information Processing Systems, vol. 34, pp. 19935–19946, 2021

work page 2021

[12] [12]

Leveraging self-supervised speech representations for domain adaptation in speech enhance- ment,

Ching Hua Lee et al., “Leveraging self-supervised speech representations for domain adaptation in speech enhance- ment,” inICASSP, 2024, pp. 10831–10835

work page 2024

[13] [13]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

Sanyuan Chen et al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[14] [14]

Uni- versal test-time adaptation through weight ensembling, diversity weighting, and prior correction,

Robert A Marsden, Mario Döbler, and Bin Yang, “Uni- versal test-time adaptation through weight ensembling, diversity weighting, and prior correction,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2555–2565

work page 2024

[15] [15]

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,

Julius Richter et al., “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,” inISCA Interspeech, 2024, pp. 4873–4877

work page 2024

[16] [16]

WHAM!: Extending Speech Separation to Noisy Environments

Gordon Wichern et al., “WHAM!: Extending speech separation to noisy environments,”arXiv preprint arXiv:1907.01160, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[17] [17]

The Diverse Environments Multi-Channel Acous- tic Noise Database (DEMAND): A database of multi- channel environmental noise recordings,

Joachim Thiemann, Nobutaka Ito, and Emmanuel Vin- cent, “The Diverse Environments Multi-Channel Acous- tic Noise Database (DEMAND): A database of multi- channel environmental noise recordings,”The Journal of the Acoustical Society of America, vol. 133, pp. 3591, 05 2013

work page 2013

[18] [18]

The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,

Christophe Veaux, Junichi Yamagishi, and Simon King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in2013 International Conference Orien- tal COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O- COCOSDA/CASLRE), 2013, pp. 1–4

work page 2013

[19] [19]

Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech,

Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech,” in9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 2016, pp. 146–152

work page 2016

[20] [20]

ICASSP 2022 Deep Noise Suppression Challenge,

Harishchandra Dubey et al., “ICASSP 2022 Deep Noise Suppression Challenge,” inICASSP, 2022

work page 2022

[21] [21]

CMGAN: Conformer-based Metric GAN for Speech Enhancement,

Ruizhe Cao, Sherif Abdulatif, and Bin Yang, “CMGAN: Conformer-based Metric GAN for Speech Enhancement,” inProc. Interspeech 2022, 2022, pp. 936–940

work page 2022

[22] [22]

MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,

Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin, “MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” inInternational Conference on Machine Learning. PmLR, 2019, pp. 2031–2041

work page 2019

[23] [23]

Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assess- ment of telephone networks and codecs,

Antony W. Rix, John G. Beerends, Mike Hollier, and Andries P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assess- ment of telephone networks and codecs,”2001 IEEE International Conference on Acoustics, Speech, and Sig- nal Processing. Proceedings (Cat. No.01CH37221), vol. 2, pp. 749–752 vol.2, 2001

work page 2001

[24] [24]

Evaluation of objective quality measures for speech enhancement,

Yi Hu and Philipos C. Loizou, “Evaluation of objective quality measures for speech enhancement,”IEEE Trans- actions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229–238, 2008

work page 2008

[25] [25]

The PESQetarian: On the relevance of goodhart’s law for speech enhancement,

Danilo de Oliveira, Simon Welker, Julius Richter, and Timo Gerkmann, “The PESQetarian: On the relevance of goodhart’s law for speech enhancement,” inProc. Interspeech 2024, 2024, pp. 3854–3858

work page 2024

[26] [26]

Decoupled weight decay regularization,

Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

work page 2019