pith. sign in

arxiv: 2602.08556 · v2 · pith:OGBE23A3new · submitted 2026-02-09 · 💻 cs.SD

Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction

Pith reviewed 2026-05-21 14:47 UTC · model grok-4.3

classification 💻 cs.SD
keywords speech enhancementphase modelingrotation equivariancemagnitude-phase interactiondeep learningphase retrievalaudio signal processing
0
0 comments X

The pith

Enforcing global rotation equivariance lets a dual-stream network model phase's circular geometry for improved speech enhancement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that conventional flat Euclidean networks fail to capture the circular topology of phase, and that enforcing global rotation equivariance in a dedicated phase stream solves this. The authors introduce a magnitude-phase dual-stream architecture whose key modules are built to preserve this equivariance during information exchange and feature fusion. If correct, the approach delivers concrete gains such as more than 20 percent lower phase distance in retrieval tasks and over 0.1 higher PESQ in cross-corpus denoising. Readers should care because accurate phase recovery remains a bottleneck in high-quality audio restoration, and most current deep models ignore the periodic structure of phase angles.

Core claim

The central claim is that a magnitude-phase dual-stream framework, using a Magnitude-Phase Interactive Convolutional Module for modulus-based exchange and a Hybrid-Attention Dual Feed-Forward Network for unified fusion, preserves Global Rotation Equivariance in the phase stream and thereby aligns features with the intrinsic circular geometry of phase, yielding superior results over multiple baselines on phase retrieval, denoising, dereverberation, and bandwidth extension.

What carries the argument

Global Rotation Equivariance (GRE) preserved by the Magnitude-Phase Interactive Convolutional Module (MPICM) and Hybrid-Attention Dual Feed-Forward Network (HADF) that enable modulus-based interaction while maintaining circular topology in the phase stream.

If this is right

  • Phase distance drops by over 20 percent in the phase retrieval task.
  • PESQ rises by more than 0.1 in zero-shot cross-corpus denoising evaluations.
  • Overall superiority holds across universal speech enhancement tasks that mix multiple distortions.
  • Learned phase features exhibit distinct periodic patterns that match the circular nature of phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same equivariant modules could be tested on other circular quantities such as inter-channel phase differences or direction-of-arrival angles.
  • Combining the dual-stream design with existing masking or generative vocoders might improve real-time enhancement pipelines.
  • Evaluating the periodic pattern consistency on larger, noisier, or multilingual corpora would test whether the circular alignment generalizes.

Load-bearing premise

Phase possesses an intrinsic circular geometry that standard flat networks cannot model effectively without explicit global rotation equivariance constraints.

What would settle it

A standard convolutional network without the GRE-preserving modules achieves equal or lower phase distance and equal or higher PESQ on the same phase-retrieval and zero-shot denoising benchmarks.

Figures

Figures reproduced from arXiv: 2602.08556 by Andong Li, Chengzhong Wang, Dingding Yao, Junfeng Li.

Figure 1
Figure 1. Figure 1: Overview of the proposed network architecture. (a) The dual-stream encoder-decoder topology. The R-Conv and C-Conv denote real-valued and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detailed structure of the MPICM block, including the magnitude and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detailed architecture of the Hybrid-Attention Dual-FFN (HADF) module. (Top Left) The macroscopic residual block structure. (Bottom) The Hybrid [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance Comparison across varying SNRs. Models were trained on the DNS-2020 corpus and evaluated on re-mixed versions of the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spectrogram visualization of enhanced speech under diverse distortion scenarios. The audio files are taken from WSJ0+WHAMR! test set. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of learned attention patterns for a voiced speech segment. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual Feed-Forward Network (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/GRE-Net.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a magnitude-phase dual-stream framework for speech enhancement that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE). It introduces the Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and the Hybrid-Attention Dual Feed-Forward Network (HADF) for unified feature fusion, both designed to preserve GRE in the phase stream. The method is evaluated on phase retrieval, denoising, dereverberation, and bandwidth extension tasks, with reported gains including over 20% reduction in Phase Distance for phase retrieval and more than 0.1 PESQ improvement in zero-shot cross-corpus denoising, plus overall superiority in universal SE tasks with mixed distortions. Qualitative analysis shows learned phase features with distinct periodic patterns, and source code is released.

Significance. If the modules indeed enforce GRE and performance gains are attributable to this property (rather than general capacity or fusion improvements), the work could meaningfully advance phase modeling in speech enhancement by providing an architectural solution to the circular topology of phase. The multi-task evaluation scope and public code release are strengths that support reproducibility and broader applicability.

major comments (2)
  1. [Abstract and proposed method] Abstract and proposed method: The central claim that MPICM and HADF preserve Global Rotation Equivariance (and that this aligns the phase stream to circular topology unlike flat Euclidean networks) is load-bearing for attributing the reported gains (20% Phase Distance reduction, +0.1 PESQ) to topology alignment. However, no derivation is given showing that a global phase rotation by angle θ on the input produces the corresponding rotation on the output features, nor is there an empirical test (e.g., rotation-equivariance verification) or ablation isolating GRE from dual-stream or attention components.
  2. [Experiments] Experiments: The quantitative superiority claims (e.g., Phase Distance reduction by over 20% in phase retrieval, PESQ gain >0.1 in zero-shot denoising) are presented without reference to specific tables, statistical testing, baseline configurations, or ablations that separate the contribution of GRE enforcement from increased model capacity or other architectural elements. This leaves the attribution of improvements to the equivariance property unsubstantiated.
minor comments (2)
  1. [Abstract] The abstract states that learned features exhibit 'distinct periodic patterns' consistent with circular phase; consider adding quantitative metrics or visualizations in the results section to strengthen this qualitative observation.
  2. [Method] Notation for phase representation and the exact definition of Global Rotation Equivariance could be clarified early in the method section for readers unfamiliar with the circular topology.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our paper. We address the major comments point-by-point below and plan to incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and proposed method] Abstract and proposed method: The central claim that MPICM and HADF preserve Global Rotation Equivariance (and that this aligns the phase stream to circular topology unlike flat Euclidean networks) is load-bearing for attributing the reported gains (20% Phase Distance reduction, +0.1 PESQ) to topology alignment. However, no derivation is given showing that a global phase rotation by angle θ on the input produces the corresponding rotation on the output features, nor is there an empirical test (e.g., rotation-equivariance verification) or ablation isolating GRE from dual-stream or attention components.

    Authors: We agree that an explicit derivation and empirical validation would better support our claims. The modules are designed with operations that inherently respect rotational symmetry, such as using modulus for interactions which is rotation-invariant and phase adjustments that are equivariant. However, to address this, in the revised version we will add a formal derivation in the method section or appendix proving the GRE property for the proposed modules. We will also include an empirical test for rotation equivariance and an ablation study to separate the GRE contribution from other components like the dual-stream architecture and attention mechanisms. revision: yes

  2. Referee: [Experiments] Experiments: The quantitative superiority claims (e.g., Phase Distance reduction by over 20% in phase retrieval, PESQ gain >0.1 in zero-shot denoising) are presented without reference to specific tables, statistical testing, baseline configurations, or ablations that separate the contribution of GRE enforcement from increased model capacity or other architectural elements. This leaves the attribution of improvements to the equivariance property unsubstantiated.

    Authors: The performance gains are reported in the experimental results section with comparisons to state-of-the-art baselines across multiple tasks. To improve clarity, we will explicitly reference the relevant tables (e.g., Table 2 for phase retrieval and Table 3 for denoising) in the revised manuscript. We will also add statistical significance tests and additional ablation experiments that control for model capacity and other factors to better attribute the improvements to the GRE enforcement. This will substantiate the claims more rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on architectural design and empirical evaluation

full rationale

The paper proposes a magnitude-phase dual-stream framework with MPICM and HADF modules explicitly designed to preserve Global Rotation Equivariance for aligning the phase stream with circular topology. This is framed as an architectural design choice rather than a mathematical derivation or prediction that reduces to fitted inputs or self-referential definitions. Empirical results on phase retrieval (20% Phase Distance reduction) and denoising (+0.1 PESQ) are presented as validation, supported by qualitative observations of periodic patterns in learned features. No equations, self-citations, or ansatzes are shown that would make the central claims equivalent to the inputs by construction; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, mathematical axioms, or newly postulated entities; the framework is described through named modules whose internal mechanics and any implicit assumptions remain unspecified.

pith-pipeline@v0.9.0 · 5771 in / 1163 out tokens · 58821 ms · 2026-05-21T14:47:56.366111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

  1. [1]

    Speech enhancement using a minimum- mean square error short-time spectral amplitude estimator,

    Y . Ephraim and D. Malah, “Speech enhancement using a minimum- mean square error short-time spectral amplitude estimator,”IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984

  2. [2]

    Speech enhancement for non-stationary noise environments,

    I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments,”Signal Process., vol. 81, no. 11, pp. 2403–2418, 2001

  3. [3]

    On training targets for su- pervised speech separation,

    Y . Wang, A. Narayanan, and D. Wang, “On training targets for su- pervised speech separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1849–1858, 2014

  4. [4]

    A regression approach to speech enhancement based on deep neural networks,

    Y . Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 1, pp. 7–19, 2014

  5. [5]

    Sixty years of frequency-domain monaural speech enhance- ment: From traditional to deep learning methods,

    C. Zheng, H. Zhang, W. Liu, X. Luo, A. Li, X. Li, and B. C. J. Moore, “Sixty years of frequency-domain monaural speech enhance- ment: From traditional to deep learning methods,”Trends Hear., vol. 27, p. 23312165231209913, 2023

  6. [6]

    Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,

    A. Li, W. Liu, C. Zheng, C. Fan, and X. Li, “Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1829–1843, 2021

  7. [7]

    WHAMR!: Noisy and reverberant single-channel speech separation,

    M. Maciejewski, G. Wichern, E. McQuinn, and J. Le Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, pp. 696–700

  8. [8]

    The unimportance of phase in speech enhance- ment,

    D. Wang and J. Lim, “The unimportance of phase in speech enhance- ment,”IEEE Trans. Acoust., Speech, Signal Process., vol. 30, no. 4, pp. 679–681, 1982

  9. [9]

    The importance of phase in speech enhancement,

    K. Paliwal, K. W ´ojcicki, and B. Shannon, “The importance of phase in speech enhancement,”Speech Commun., vol. 53, no. 4, pp. 465–494, 2011

  10. [10]

    On the importance of power compression and phase estimation in monaural speech dereverberation,

    A. Li, C. Zheng, R. Peng, and X. Li, “On the importance of power compression and phase estimation in monaural speech dereverberation,” JASA Express Lett., vol. 1, no. 1, p. 014802, Jan. 2021. [Online]. Available: https://doi.org/10.1121/10.0003321

  11. [11]

    Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction,

    Y .-X. Lu, Y . Ai, H.-P. Du, and Z.-H. Ling, “Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 33, pp. 236–250, 2025

  12. [12]

    Phase processing for single-channel speech enhancement: History and recent advances,

    T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, “Phase processing for single-channel speech enhancement: History and recent advances,” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 55–66, 2015. 13

  13. [13]

    PHASEN: A phase-and- harmonics-aware speech enhancement network,

    D. Yin, C. Luo, Z. Xiong, and W. Zeng, “PHASEN: A phase-and- harmonics-aware speech enhancement network,” inProc. AAAI Conf. Artif. Intell., vol. 34, no. 05, 2020, pp. 9458–9465

  14. [14]

    Dual-branch attention-in-attention transformer for single-channel speech enhance- ment,

    G. Yu, A. Li, C. Zheng, Y . Guo, Y . Wang, and H. Wang, “Dual-branch attention-in-attention transformer for single-channel speech enhance- ment,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 761–765

  15. [15]

    CMGAN: Conformer-based metric- GAN for monaural speech enhancement,

    S. Abdulatif, R. Cao, and B. Yang, “CMGAN: Conformer-based metric- GAN for monaural speech enhancement,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 2477–2493, 2024

  16. [16]

    MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,

    Y .-X. Lu, Y . Ai, and Z.-H. Ling, “MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,” inProc. Interspeech, 2023, pp. 3834–3838

  17. [17]

    Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,

    Y . Ai and Z.-H. Ling, “Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023, pp. 1–5

  18. [18]

    Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement,

    Y .-X. Lu, Y . Ai, and Z.-H. Ling, “Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement,”Neural Netw., vol. 189, p. 107562, 2025

  19. [19]

    Mamba- SEUNet: Mamba UNet for monaural speech enhancement,

    J. Wang, Z. Lin, T. Wang, M. Ge, L. Wang, and J. Dang, “Mamba- SEUNet: Mamba UNet for monaural speech enhancement,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2025, pp. 1–5

  20. [20]

    ZipEnhancer: Dual-path down-up sampling- based zipformer for monaural speech enhancement,

    H. Wang and B. Tian, “ZipEnhancer: Dual-path down-up sampling- based zipformer for monaural speech enhancement,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2025, pp. 1–5

  21. [21]

    MN- Net: Speech enhancement network via modeling the noise,

    Y . Hu, Q. Yang, W. Wei, L. Lin, L. He, Z. Ou, and W. Yang, “MN- Net: Speech enhancement network via modeling the noise,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 33, pp. 1208–1219, 2025

  22. [22]

    Interactive target positive and negative features modeling for monaural speech enhancement,

    X. Xu, W. Tu, Y . Yang, J. Li, and Y . Zhang, “Interactive target positive and negative features modeling for monaural speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 33, pp. 4856– 4869, 2025

  23. [23]

    DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,

    Y . Hu, Y . Liu, S. Lv, M. Xing, S. Zhang, Y . Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” inProc. Interspeech, 2020, pp. 2472–2476

  24. [24]

    Inter-frequency phase difference for phase reconstruction using deep neural networks and maximum likelihood,

    N. B. Thien, Y . Wakabayashi, K. Iwai, and T. Nishiura, “Inter-frequency phase difference for phase reconstruction using deep neural networks and maximum likelihood,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 1667–1680, 2023

  25. [25]

    Unre- stricted global phase bias-aware single-channel speech enhancement with conformer-based metric GAN,

    S. Zhang, Z. Qiu, D. Takeuchi, N. Harada, and S. Makino, “Unre- stricted global phase bias-aware single-channel speech enhancement with conformer-based metric GAN,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2024, pp. 1026–1030

  26. [26]

    Phase reconstruction based on recurrent phase unwrapping with deep neural networks,

    Y . Masuyama, K. Yatabe, Y . Koizumi, Y . Oikawa, and N. Harada, “Phase reconstruction based on recurrent phase unwrapping with deep neural networks,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process., Barcelona, Spain, 2020, pp. 726–730

  27. [27]

    D. C. Ghiglia and M. D. Pritt,Two-Dimensional Phase Unwrapping: Theory, Algorithms, and Software. New York, NY , USA: Wiley, 1998

  28. [28]

    Universal discrete-domain speech enhancement,

    F. Liu, Y . Ai, Y .-X. Lu, R.-C. Zheng, H.-P. Du, and Z.-H. Ling, “Universal discrete-domain speech enhancement,”IEEE Trans. Audio Speech Lang. Process., vol. 34, pp. 285–298, 2026

  29. [29]

    Holographic transformers for complex-valued signal pro- cessing: Integrating phase interference into self-attention,

    E. Huang, Z. Zhang, T. Xu, C. Xia, K. Hu, Y . Yang, T. Pan, D. Dong, and Z. Qin, “Holographic transformers for complex-valued signal pro- cessing: Integrating phase interference into self-attention,”arXiv, 2025

  30. [30]

    Spectral-envelope and group-delay models for transient signals—applications to castanets and stop conso- nants,

    S. Maitra and B. Yegnanarayana, “Spectral-envelope and group-delay models for transient signals—applications to castanets and stop conso- nants,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2011, pp. 4412–4415

  31. [31]

    Group delay functions and its applications in speech technology,

    H. A. Murthy and B. Yegnanarayana, “Group delay functions and its applications in speech technology,”S ¯adhan¯a, vol. 36, no. 5, pp. 745– 782, 2011

  32. [32]

    Complex-valued neural networks: A comprehensive survey,

    C. Lee, H. Hasegawa, and S. Gao, “Complex-valued neural networks: A comprehensive survey,”IEEE/CAA J. Autom. Sinica, vol. 9, no. 8, pp. 1406–1426, 2022

  33. [33]

    DBT-Net: Dual-branch federative magnitude and phase estimation with attention- in-attention transformer for monaural speech enhancement,

    G. Yu, A. Li, H. Wang, Y . Wang, Y . Ke, and C. Zheng, “DBT-Net: Dual-branch federative magnitude and phase estimation with attention- in-attention transformer for monaural speech enhancement,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 2629–2644, 2022

  34. [34]

    Root mean square layer normalization,

    B. Zhang and R. Sennrich, “Root mean square layer normalization,” Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019

  35. [35]

    Building blocks for a complex-valued trans- former architecture,

    F. Eilers and X. Jiang, “Building blocks for a complex-valued trans- former architecture,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023, pp. 1–5

  36. [36]

    TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain,

    K. Wang, B. He, and W.-P. Zhu, “TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 7098–7102

  37. [37]

    Investigat- ing RNN-based speech enhancement methods for noise-robust text-to- speech,

    C. V . Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigat- ing RNN-based speech enhancement methods for noise-robust text-to- speech,” inProc. ISCA Speech Synth. Workshop, 2016, pp. 159–165

  38. [38]

    The INTERSPEECH 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,

    C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” inProc. Interspeech, 2020, pp. 2492–2496

  39. [39]

    The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,

    C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” inProc. Orient. COCOSDA, 2013, pp. 1–4

  40. [40]

    The diverse environments multi- channel acoustic noise database (DEMAND): A database of multichan- nel environmental noise recordings,

    J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi- channel acoustic noise database (DEMAND): A database of multichan- nel environmental noise recordings,” inProc. Meet. Acoust., vol. 19, no. 1, 2013, p. 035081

  41. [41]

    ICASSP 2021 deep noise suppression challenge,

    C. K. A. Reddy, H. Dubey, V . Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan, “ICASSP 2021 deep noise suppression challenge,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 6623–6627

  42. [42]

    CSR-I (WSJ0) Complete,

    J. S. Garofoloet al., “CSR-I (WSJ0) Complete,” LDC93S6A, 1993. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S6A

  43. [43]

    Image method for efficiently simulating small-room acoustics,

    J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,”J. Acoust. Soc. Amer., vol. 65, no. 4, pp. 943–950, 1979

  44. [44]

    HiFi-GAN: Generative adversarial net- works for efficient and high fidelity speech synthesis,

    J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial net- works for efficient and high fidelity speech synthesis,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 17 022–17 033, 2020

  45. [45]

    BAPEN: Towards versatile audio phase retrieval,

    L. Dai, A. Li, Z. Han, C. Zheng, and X. Li, “BAPEN: Towards versatile audio phase retrieval,” inProc. ACM Int. Conf. Multimedia, 2025, pp. 8293–8302

  46. [46]

    Phase- aware speech enhancement with deep complex U-Net,

    H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, “Phase- aware speech enhancement with deep complex U-Net,” inProc. Int. Conf. Learn. Represent. (ICLR), 2019

  47. [47]

    Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs,

    A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., vol. 2, 2001, pp. 749–752

  48. [48]

    A short- time objective intelligibility measure for time-frequency weighted noisy speech,

    C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2010, pp. 4214–4217

  49. [49]

    Evaluation of objective quality measures for speech enhancement,

    Y . Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,”IEEE Trans. Audio Speech Lang. Process., vol. 16, no. 1, pp. 229–238, 2008

  50. [50]

    SDR – half- baked or well done?

    J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half- baked or well done?” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 626–630

  51. [51]

    DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,

    C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 6493–6497

  52. [52]

    UTMOS: UTokyo-SaruLab system for V oiceMOS chal- lenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for V oiceMOS chal- lenge 2022,” inProc. Interspeech, 2022, pp. 4521–4525

  53. [53]

    Signal estimation from modified short-time Fourier transform,

    D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,”IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236–243, 1984

  54. [54]

    DiffPhase: Generative diffusion- based STFT phase retrieval,

    T. Peer, S. Welker, and T. Gerkmann, “DiffPhase: Generative diffusion- based STFT phase retrieval,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023, pp. 1–5

  55. [55]

    An investigation of incorporating mamba for speech enhancement,

    R. Chao, W.-H. Cheng, M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y . Tsao, “An investigation of incorporating mamba for speech enhancement,” inProc. IEEE Spoken Lang. Technol. Workshop (SLT), 2024, pp. 302–308

  56. [56]

    FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement,

    S. Zhao, B. Ma, K. N. Watcharasupat, and W.-S. Gan, “FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 9281–9285

  57. [57]

    Universal score- based speech enhancement with high content preservation,

    R. Scheibler, Y . Fujita, Y . Shirahata, and T. Komatsu, “Universal score- based speech enhancement with high content preservation,” inProc. Interspeech, 2024, pp. 1165–1169