pith. sign in

arxiv: 2506.13127 · v3 · pith:UFNZGJZPnew · submitted 2025-06-16 · 💻 cs.SD · eess.AS

Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

Pith reviewed 2026-05-22 00:45 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords speech enhancementknowledge distillationtime-frequency calibrationrecursive fusionintra-set inter-set correlationlow-complexity student modelmulti-layer distillation
0
0 comments X

The pith

A new distillation framework for speech enhancement integrates local and global knowledge through intra-set and inter-set recursive fusion plus dual-stream time-frequency cross-calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a knowledge distillation method that improves speech enhancement by transferring capabilities from a complex teacher model to a simpler student model. It builds correlated feature sets, performs pairwise multi-layer matching inside each set, and uses recursive fusion to circulate global information across sets. A dual-stream mechanism then calibrates teacher-student similarity weights separately in the time domain and the frequency domain before crossing those weights to allocate distillation effort according to actual speech characteristics. If the approach works, resource-limited devices could run higher-quality enhancement without needing the full teacher network.

Core claim

The I²SRF-TFCKD framework constructs intra-set and inter-set correlations for collaborative distillation, generates fused representative features via recursive fusion, and applies multi-layer interactive distillation with dual-stream time-frequency cross-calibration to exploit speech time-frequency differentials, yielding consistent gains for the low-complexity student model over other distillation schemes on both single-channel and multi-channel datasets.

What carries the argument

Dual-stream time-frequency cross-calibration that computes similarity weights in the time and frequency domains separately then cross-weights them to refine per-layer distillation contributions based on speech signal traits.

If this is right

  • The low-complexity student model records higher objective scores on single-channel and multi-channel speech enhancement tasks than before distillation.
  • The method surpasses other distillation schemes in direct comparisons on the same datasets.
  • The framework can be applied to existing high-ranking networks such as DPDCRN to produce efficient yet capable enhancement systems.
  • Local information focusing and global knowledge circulation occur simultaneously within one distillation pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar time-frequency cross-calibration could be tested in other audio tasks where spectral and temporal structure must be preserved during model compression.
  • The recursive fusion step suggests a general pattern for circulating knowledge across multiple related training subsets in distillation pipelines.
  • If the calibration weights prove stable across datasets, the approach may reduce the need for hand-tuned layer importance in future speech models.

Load-bearing premise

That speech signals contain distinct time-frequency differential information that pairwise multi-layer matching and dual-stream cross-calibration can reliably capture and exploit for measurable performance gains.

What would settle it

If objective metrics on the single-channel and multi-channel test sets show the student model gains no advantage or loses to standard distillation baselines, the claimed benefit of the time-frequency calibrated recursive fusion would not hold.

Figures

Figures reproduced from arXiv: 2506.13127 by Bj\"orn W. Schuller, Chao Xu, Jiaming Cheng, Jing Li, Rui Liu, Ruiyu Liang, Wei Zhou, Xiaoshuai Hao, Ye Ni.

Figure 1
Figure 1. Figure 1: Backbone network architecture of the teacher model. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of the I2S-TFCKD framework. We present the detailed process of intra-inter set distillation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Similarity mapping of time and frequency flows. The self-similarity [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training curve trends on the DNS validation set. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Frame-level heatmap distribution of time-frequency alignment weights. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The scatter-bubble distribution of PESQ, FLOPs, and model param [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Boxplot of PESQ and the T1 Metric for different distillation strategies on the L3DAS23 validation and development sets. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

In this paper, we propose an intra-set and inter-set recursive fusion framework with time-frequency calibrated knowledge distillation (I$^2$SRF-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully exploits the time-frequency differential information of speech while facilitating both local information focusing and global knowledge circulation. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through recursive fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$SRF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes I²SRF-TFCKD, an intra-set and inter-set recursive fusion framework with time-frequency calibrated knowledge distillation for speech enhancement. It builds a collaborative distillation paradigm using multi-layer teacher-student feature matching within correlated sets, recursive fusion for inter-set interaction, and dual-stream time-frequency cross-calibration for refined layer-wise distillation weights based on speech time-frequency characteristics. The method is applied to the DPDCRN architecture (first place in L3DAS23 SE track) and evaluated on single- and multi-channel datasets, claiming consistent improvements to the low-complexity student model and outperformance versus other distillation schemes.

Significance. If the empirical gains are robustly verified, the work could contribute to more effective knowledge distillation for speech enhancement by explicitly leveraging time-frequency differential information for both local focusing and global circulation. The choice of a high-performing baseline (DPDCRN) and evaluation across single- and multi-channel scenarios are strengths that increase potential impact in practical low-complexity SE deployments.

major comments (2)
  1. [Method (multi-layer interactive distillation) and Experiments] The central claim that pairwise multi-layer matching plus dual-stream time-frequency cross-calibration produces measurable, consistent gains by exploiting time-frequency differential information (abstract and method description) is load-bearing for the outperformance assertion. No ablation is described that removes only the cross-calibration (or the recursive inter-set fusion) while holding the rest of the pipeline fixed; without such controls it remains possible that observed improvements arise from basic multi-layer matching or training schedule differences rather than the claimed TF exploitation.
  2. [Experiments / Results] The abstract states that objective evaluations demonstrate improvements and outperformance, yet the provided description supplies no quantitative metrics (e.g., PESQ, STOI, SI-SDR deltas), error bars, statistical tests, or explicit data-split details. These are required to substantiate the claim that the strategy 'consistently and effectively improves' the student model across datasets.
minor comments (2)
  1. [Abstract / Introduction] The acronym I²SRF-TFCKD is introduced without immediate expansion; spelling it out on first use would improve readability.
  2. [Introduction] Ensure all referenced prior distillation strategies for SE are accompanied by specific citations rather than general statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below with clarifications drawn from the manuscript and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Method (multi-layer interactive distillation) and Experiments] The central claim that pairwise multi-layer matching plus dual-stream time-frequency cross-calibration produces measurable, consistent gains by exploiting time-frequency differential information (abstract and method description) is load-bearing for the outperformance assertion. No ablation is described that removes only the cross-calibration (or the recursive inter-set fusion) while holding the rest of the pipeline fixed; without such controls it remains possible that observed improvements arise from basic multi-layer matching or training schedule differences rather than the claimed TF exploitation.

    Authors: We agree that a more granular ablation isolating the dual-stream time-frequency cross-calibration (and separately the recursive inter-set fusion) would strengthen the evidence that gains arise specifically from TF differential exploitation rather than generic multi-layer matching. The current manuscript reports comparisons of the full I²SRF-TFCKD against prior distillation methods and includes component-level analysis within the collaborative paradigm, but does not present the exact controlled removal requested. In the revised manuscript we will add these targeted ablations while holding all other elements (including training schedule and base multi-layer matching) fixed. revision: yes

  2. Referee: [Experiments / Results] The abstract states that objective evaluations demonstrate improvements and outperformance, yet the provided description supplies no quantitative metrics (e.g., PESQ, STOI, SI-SDR deltas), error bars, statistical tests, or explicit data-split details. These are required to substantiate the claim that the strategy 'consistently and effectively improves' the student model across datasets.

    Authors: The full manuscript (Section 4 and associated tables) reports concrete quantitative results on both single- and multi-channel datasets, including PESQ, STOI and SI-SDR values with direct comparisons to the teacher, the student without distillation, and competing distillation schemes. Data splits follow the standard partitions of the respective benchmarks (e.g., L3DAS23 and other public SE corpora). To make these findings immediately visible, we will incorporate representative numerical deltas into the abstract and ensure error bars together with any statistical significance statements are explicitly stated or added in the results section. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains from proposed TF-calibrated KD rest on dataset comparisons, not self-referential definitions or fits

full rationale

The paper describes an I²SRF-TFCKD framework that combines intra-set/inter-set recursive fusion with dual-stream time-frequency cross-calibration for distilling a student model from a teacher on speech enhancement tasks. The central claim of consistent improvement over other KD schemes is presented as the outcome of objective evaluations on single- and multi-channel datasets, with no equations, derivations, or parameter-fitting steps shown that would make the reported gains equivalent to the method's own inputs by construction. The described mechanisms (pairwise multi-layer matching, recursive fusion, and cross-weighting of similarity calibration weights) are architectural choices whose contribution is asserted via external benchmark comparisons rather than internal redefinition or self-citation chains. This is a standard empirical ML paper whose validity hinges on reproducibility of the experiments, not on any load-bearing step that collapses to tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The method introduces multiple design choices whose effectiveness is not derived from first principles but asserted through empirical results; without the full text the exact count of free parameters remains unknown.

free parameters (1)
  • layer-wise matching weights and calibration hyperparameters
    The multi-layer interactive distillation and cross-weighting require choices of similarity metrics and weighting functions that are fitted or tuned during training.

pith-pipeline@v0.9.0 · 5805 in / 1279 out tokens · 38097 ms · 2026-05-22T00:45:44.791735+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

  1. [1]

    Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods,

    C. Zheng, H. Zhang, W. Liu, X. Luo, A. Li, X. Li, and B. C. Moore, “Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods,” Trends in Hearing , vol. 27, pp. 1–52, 2023

  2. [2]

    Multiple statis- tical models for soft decision in noisy speech enhancement,

    J.-H. Chang, S. Gazor, N. S. Kim, and S. K. Mitra, “Multiple statis- tical models for soft decision in noisy speech enhancement,” Pattern Recognition, vol. 40, no. 3, pp. 1123–1134, 2007

  3. [3]

    A regression approach to speech enhancement based on deep neural networks,

    Y . Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM transactions on audio, speech, and language processing , vol. 23, no. 1, pp. 7–19, 2014

  4. [4]

    The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,

    C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” in Inter- speech 2020 , 2020, pp. 2492–2496

  5. [5]

    Fullsubnet: A full-band and sub- band fusion model for real-time single-channel speech enhancement,

    X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A full-band and sub- band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 6633–6637. JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXX XXXX 13

  6. [6]

    Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,

    S. Zhao, B. Ma, K. N. Watcharasupat, and W.-S. Gan, “Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 9281–9285

  7. [7]

    Dynamical channel pruning by conditional accuracy change for deep neural networks,

    Z. Chen, T.-B. Xu, C. Du, C.-L. Liu, and H. He, “Dynamical channel pruning by conditional accuracy change for deep neural networks,”IEEE Transactions on Neural Networks and Learning Systems , vol. 32, no. 2, pp. 799–813, 2021

  8. [8]

    General bitwidth assignment for efficient deep convolutional neural network quantization,

    W. Fei, W. Dai, C. Li, J. Zou, and H. Xiong, “General bitwidth assignment for efficient deep convolutional neural network quantization,” IEEE Transactions on Neural Networks and Learning Systems , vol. 33, no. 10, pp. 5253–5267, 2022

  9. [9]

    Col- laborative knowledge distillation via multiknowledge transfer,

    J. Gou, L. Sun, B. Yu, L. Du, K. Ramamohanarao, and D. Tao, “Col- laborative knowledge distillation via multiknowledge transfer,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 5, pp. 6718–6730, 2024

  10. [10]

    Weight, block or unit? exploring sparsity tradeoffs for speech enhancement on tiny neural accelerators,

    M. Stamenovic, N. L. Westhausen, L.-C. Yang, C. Jensen, and A. Pawlicki, “Weight, block or unit? exploring sparsity tradeoffs for speech enhancement on tiny neural accelerators,” arXiv preprint arXiv:2111.02351, 2021

  11. [11]

    Towards model compression for deep learning based speech enhancement,

    K. Tan and D. Wang, “Towards model compression for deep learning based speech enhancement,” IEEE/ACM transactions on audio, speech, and language processing , vol. 29, pp. 1785–1794, 2021

  12. [12]

    Test-time adaptation toward personalized speech enhancement: Zero-shot learning with knowledge distillation,

    S. Kim and M. Kim, “Test-time adaptation toward personalized speech enhancement: Zero-shot learning with knowledge distillation,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021, pp. 176–180

  13. [13]

    Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation,

    M. Thakker, S. E. Eskimez, T. Yoshioka, and H. Wang, “Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation,” in Interspeech 2022, 2022, pp. 991– 995

  14. [14]

    Abc-kd: Attention- based-compression knowledge distillation for deep learning-based noise suppression,

    Y . Wan, Y . Zhou, X. Peng, K.-W. Chang, and Y . Lu, “Abc-kd: Attention- based-compression knowledge distillation for deep learning-based noise suppression,” in Interspeech 2023 , 2023, pp. 2528–2532

  15. [15]

    Two-step knowledge distillation for tiny speech enhancement,

    R. D. Nathoo, M. Kegler, and M. Stamenovic, “Two-step knowledge distillation for tiny speech enhancement,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 141–10 145

  16. [16]

    Residual fusion probabilistic knowledge distillation for speech en- hancement,

    J. Cheng, R. Liang, L. Zhou, L. Zhao, C. Huang, and B. W. Schuller, “Residual fusion probabilistic knowledge distillation for speech en- hancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2680–2691, 2024

  17. [17]

    Dual-path dilated convolutional recurrent network with group attention for multi-channel speech enhancement,

    J. Cheng, C. Pang, R. Liang, J. Fan, and L. Zhao, “Dual-path dilated convolutional recurrent network with group attention for multi-channel speech enhancement,” in ICASSP 2023-2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–2

  18. [18]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531 , 2015

  19. [19]

    Learning an inference- accelerated network from a pre-trained model with frequency-enhanced feature distillation,

    X. Niu, J. Gu, G. Zhang, P. Wan, and Z. Wang, “Learning an inference- accelerated network from a pre-trained model with frequency-enhanced feature distillation,” in Proceedings of the 30th ACM International Conference on Multimedia , 2022, pp. 1847–1856

  20. [20]

    Cross- image relational knowledge distillation for semantic segmentation,

    C. Yang, H. Zhou, Z. An, X. Jiang, Y . Xu, and Q. Zhang, “Cross- image relational knowledge distillation for semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 319–12 328

  21. [21]

    Distilling knowledge via knowledge review,

    P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge review,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 5008–5017

  22. [22]

    Inherit with distillation and evolve with contrast: Exploring class incremental semantic segmentation without exemplar memory,

    D. Zhao, B. Yuan, and Z. Shi, “Inherit with distillation and evolve with contrast: Exploring class incremental semantic segmentation without exemplar memory,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 11 932–11 947, 2023

  23. [23]

    Knowledge distillation from transformers for low-complexity acoustic scene clas- sification

    F. Schmid, S. Masoudian, K. Koutini, and G. Widmer, “Knowledge distillation from transformers for low-complexity acoustic scene clas- sification.” in DCASE, 2022

  24. [24]

    Audio-visual representa- tion learning via knowledge distillation from speech foundation models,

    J.-X. Zhang, G. Wan, J. Gao, and Z.-H. Ling, “Audio-visual representa- tion learning via knowledge distillation from speech foundation models,” Pattern Recognition, p. 111432, 2025

  25. [25]

    Sub-band knowledge distillation framework for speech enhancement,

    X. Hao, S. Wen, X. Su, Y . Liu, G. Gao, and X. Li, “Sub-band knowledge distillation framework for speech enhancement,” in Interspeech 2020 , 2020, pp. 2687–2691

  26. [26]

    Cross-layer similarity knowledge distillation for speech enhancement

    J. Cheng, R. Liang, Y . Xie, L. Zhao, B. Schuller, J. Jia, and Y . Peng, “Cross-layer similarity knowledge distillation for speech enhancement.” in INTERSPEECH, 2022, pp. 926–930

  27. [27]

    Multi-view attention transfer for efficient speech enhancement,

    W. Shin, H. J. Park, J. S. Kim, B. H. Lee, and S. W. Han, “Multi-view attention transfer for efficient speech enhancement,” inInterspeech 2022, 2022, pp. 1198–1202

  28. [28]

    Dynamic frequency-adaptive knowledge distillation for speech enhancement,

    X. Yuan, S. Liu, H. Chen, L. Zhou, J. Li, and J. Hu, “Dynamic frequency-adaptive knowledge distillation for speech enhancement,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2025, pp. 1–5

  29. [29]

    Distil-dccrn: A small- footprint dccrn leveraging feature-based knowledge distillation in speech enhancement,

    R. Han, W. Xu, Z. Zhang, M. Liu, and L. Xie, “Distil-dccrn: A small- footprint dccrn leveraging feature-based knowledge distillation in speech enhancement,” IEEE Signal Processing Letters , 2024

  30. [30]

    Semckd: Semantic calibration for cross-layer knowledge distillation,

    C. Wang, D. Chen, J.-P. Mei, Y . Zhang, Y . Feng, and C. Chen, “Semckd: Semantic calibration for cross-layer knowledge distillation,” IEEE Transactions on Knowledge and Data Engineering , vol. 35, no. 6, pp. 6305–6319, 2022

  31. [31]

    Real time speech enhancement in the waveform domain,

    A. D ´efossez, G. Synnaeve, and Y . Adi, “Real time speech enhancement in the waveform domain,” in Interspeech 2020 , 2020, pp. 3291–3295

  32. [32]

    Teacher-student learn- ing for low-latency online speech enhancement using wave-u-net,

    S. Nakaoka, L. Li, S. Inoue, and S. Makino, “Teacher-student learn- ing for low-latency online speech enhancement using wave-u-net,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 661–665

  33. [33]

    FitNets: Hints for Thin Deep Nets

    A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Ben- gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550 , 2014

  34. [34]

    A hybrid dsp/deep learning approach to real-time full- band speech enhancement,

    J.-M. Valin, “A hybrid dsp/deep learning approach to real-time full- band speech enhancement,” in 2018 IEEE 20th international workshop on multimedia signal processing (MMSP) . IEEE, 2018, pp. 1–5

  35. [35]

    Weighted speech distortion losses for neural-network-based real-time speech enhancement,

    Y . Xia, S. Braun, C. K. Reddy, H. Dubey, R. Cutler, and I. Tashev, “Weighted speech distortion losses for neural-network-based real-time speech enhancement,” in ICASSP 2020-2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 871–875

  36. [36]

    Dual-signal transformation lstm network for real-time noise suppression,

    N. L. Westhausen and B. T. Meyer, “Dual-signal transformation lstm network for real-time noise suppression,” in Interspeech 2020 , 2020, pp. 2477–2481

  37. [37]

    Dccrn: Deep complex convolution recurrent network for phase- aware speech enhancement,

    Y . Hu, Y . Liu, S. Lv, M. Xing, S. Zhang, Y . Fu, J. Wu, B. Zhang, and L. Xie, “Dccrn: Deep complex convolution recurrent network for phase- aware speech enhancement,” in Interspeech 2020, 2020, pp. 2472–2476

  38. [38]

    Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement,

    J. Chen, Z. Wang, D. Tuo, Z. Wu, S. Kang, and H. Meng, “Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 7857–7861

  39. [39]

    Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,

    A. Li, W. Liu, C. Zheng, C. Fan, and X. Li, “Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1829–1843, 2021

  40. [40]

    Primek-net: Multi-scale spectral learning via group prime-kernel convolutional neural networks for single channel speech enhancement,

    Z. Lin, J. Wang, R. Li, F. Shen, and X. Xuan, “Primek-net: Multi-scale spectral learning via group prime-kernel convolutional neural networks for single channel speech enhancement,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  41. [41]

    A neural beamforming network for b-format 3d speech enhance- ment and recognition,

    X. Ren, L. Chen, X. Zheng, C. Xu, X. Zhang, C. Zhang, L. Guo, and B. Yu, “A neural beamforming network for b-format 3d speech enhance- ment and recognition,” in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP) . IEEE, 2021, pp. 1–6

  42. [42]

    Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,

    A. Li, W. Liu, C. Zheng, and X. Li, “Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 6487–6491

  43. [43]

    Stream attention based u-net for l3das23 challenge,

    H. Wang, Y . Fu, J. Li, M. Ge, L. Wang, and X. Qian, “Stream attention based u-net for l3das23 challenge,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–2

  44. [44]

    Deft-an: Dense frequency-time attentive net- work for multichannel speech enhancement,

    D. Lee and J.-W. Choi, “Deft-an: Dense frequency-time attentive net- work for multichannel speech enhancement,” IEEE Signal Processing Letters, vol. 30, pp. 155–159, 2023

  45. [45]

    Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,

    C. Quan and X. Li, “Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 32, pp. 1310–1323, 2024

  46. [46]

    Per- ceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

    A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) , vol. 2. IEEE, 2001, pp. 749–752

  47. [47]

    An algorithm for intelligibility prediction of time–frequency weighted noisy speech,

    C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXX XXXX 14 IEEE Transactions on audio, speech, and language processing , vol. 19, no. 7, pp. 2125–2136, 2011

  48. [48]

    Sdr–half-baked or well done?

    J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 626–630

  49. [49]

    Evaluation of objective quality measures for speech enhancement,

    Y . Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on audio, speech, and language processing, vol. 16, no. 1, pp. 229–238, 2007

  50. [50]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems , vol. 33, pp. 12 449– 12 460, 2020

  51. [51]

    Statistical methods for research workers,

    R. A. Fisher, “Statistical methods for research workers,” in Break- throughs in statistics: Methodology and distribution . Springer, 1970, pp. 66–70. Jiaming Cheng received the PhD degree from Southeast University, Nanjing, China, in 2024. He is currently a Lecturer with the School of Com- puter Science, Nanjing Audit University, Nanjing, China. His resea...

  52. [52]

    His research interests include big data technology and artificial intelligence

    He is currently a Professor with the School of Computer Science, Nanjing Audit University, Nanjing, China. His research interests include big data technology and artificial intelligence. Ye Ni received the M.S. degree from Nanjing University, Nanjing, China, in 2022. He is cur- rently working toward a PhD degree from Southeast University, Nanjing, China. ...