Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

Bj\"orn W. Schuller; Chao Xu; Jiaming Cheng; Jing Li; Rui Liu; Ruiyu Liang; Wei Zhou; Xiaoshuai Hao; Ye Ni

arxiv: 2506.13127 · v3 · pith:UFNZGJZPnew · submitted 2025-06-16 · 💻 cs.SD · eess.AS

Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

Jiaming Cheng , Ruiyu Liang , Ye Ni , Chao Xu , Jing Li , Wei Zhou , Rui Liu , Bj\"orn W. Schuller

show 1 more author

Xiaoshuai Hao

This is my paper

Pith reviewed 2026-05-22 00:45 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords speech enhancementknowledge distillationtime-frequency calibrationrecursive fusionintra-set inter-set correlationlow-complexity student modelmulti-layer distillation

0 comments

The pith

A new distillation framework for speech enhancement integrates local and global knowledge through intra-set and inter-set recursive fusion plus dual-stream time-frequency cross-calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a knowledge distillation method that improves speech enhancement by transferring capabilities from a complex teacher model to a simpler student model. It builds correlated feature sets, performs pairwise multi-layer matching inside each set, and uses recursive fusion to circulate global information across sets. A dual-stream mechanism then calibrates teacher-student similarity weights separately in the time domain and the frequency domain before crossing those weights to allocate distillation effort according to actual speech characteristics. If the approach works, resource-limited devices could run higher-quality enhancement without needing the full teacher network.

Core claim

The I²SRF-TFCKD framework constructs intra-set and inter-set correlations for collaborative distillation, generates fused representative features via recursive fusion, and applies multi-layer interactive distillation with dual-stream time-frequency cross-calibration to exploit speech time-frequency differentials, yielding consistent gains for the low-complexity student model over other distillation schemes on both single-channel and multi-channel datasets.

What carries the argument

Dual-stream time-frequency cross-calibration that computes similarity weights in the time and frequency domains separately then cross-weights them to refine per-layer distillation contributions based on speech signal traits.

If this is right

The low-complexity student model records higher objective scores on single-channel and multi-channel speech enhancement tasks than before distillation.
The method surpasses other distillation schemes in direct comparisons on the same datasets.
The framework can be applied to existing high-ranking networks such as DPDCRN to produce efficient yet capable enhancement systems.
Local information focusing and global knowledge circulation occur simultaneously within one distillation pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar time-frequency cross-calibration could be tested in other audio tasks where spectral and temporal structure must be preserved during model compression.
The recursive fusion step suggests a general pattern for circulating knowledge across multiple related training subsets in distillation pipelines.
If the calibration weights prove stable across datasets, the approach may reduce the need for hand-tuned layer importance in future speech models.

Load-bearing premise

That speech signals contain distinct time-frequency differential information that pairwise multi-layer matching and dual-stream cross-calibration can reliably capture and exploit for measurable performance gains.

What would settle it

If objective metrics on the single-channel and multi-channel test sets show the student model gains no advantage or loses to standard distillation baselines, the claimed benefit of the time-frequency calibrated recursive fusion would not hold.

Figures

Figures reproduced from arXiv: 2506.13127 by Bj\"orn W. Schuller, Chao Xu, Jiaming Cheng, Jing Li, Rui Liu, Ruiyu Liang, Wei Zhou, Xiaoshuai Hao, Ye Ni.

**Figure 2.** Figure 2: Overall architecture of the I2S-TFCKD framework. We present the detailed process of intra-inter set distillation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Similarity mapping of time and frequency flows. The self-similarity [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Training curve trends on the DNS validation set. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Frame-level heatmap distribution of time-frequency alignment weights. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The scatter-bubble distribution of PESQ, FLOPs, and model param [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Boxplot of PESQ and the T1 Metric for different distillation strategies on the L3DAS23 validation and development sets. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

In this paper, we propose an intra-set and inter-set recursive fusion framework with time-frequency calibrated knowledge distillation (I$^2$SRF-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully exploits the time-frequency differential information of speech while facilitating both local information focusing and global knowledge circulation. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through recursive fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$SRF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This KD paper for speech enhancement adds recursive fusion and TF cross-calibration but the central gains rest on unablated comparisons.

read the letter

The paper's main contribution is a knowledge distillation method for speech enhancement called I²SRF-TFCKD. It uses intra-set and inter-set recursive fusion along with dual time-frequency cross-calibration to transfer knowledge from a teacher to a lighter student model based on DPDCRN. They start by creating correlated sets where teacher and student features are matched layer by layer. Then recursive fusion generates representative features for inter-set interaction. The key part is the multi-layer interactive distillation that computes similarity weights separately in time and frequency domains and cross-weights them. This is meant to allocate distillation effort according to speech characteristics. This approach is new in how it combines the recursive fusion with the specific TF calibration for SE distillation. It does well by building directly on a winning challenge model and evaluating across single and multi-channel setups. The claim is that the student improves consistently and beats other KD methods. The soft spots are that the abstract lacks any numbers or details on the improvements, and there are no ablations shown that isolate the contribution of the cross-calibration or the recursive fusion. The stress-test concern is valid here because without those controls, the gains could come from the multi-layer matching alone or other training choices. The full paper might address this, but based on the description, the evidence for the specific mechanism is not yet strong. This work is for researchers focused on model compression in audio processing, especially those looking for ways to make SE models more efficient for real devices. A reader working on KD techniques would find the framework description useful. I recommend sending it to peer review. The idea is grounded in the problem and the experiments, even if they need more rigor to confirm the novelty of the components.

Referee Report

2 major / 2 minor

Summary. The paper proposes I²SRF-TFCKD, an intra-set and inter-set recursive fusion framework with time-frequency calibrated knowledge distillation for speech enhancement. It builds a collaborative distillation paradigm using multi-layer teacher-student feature matching within correlated sets, recursive fusion for inter-set interaction, and dual-stream time-frequency cross-calibration for refined layer-wise distillation weights based on speech time-frequency characteristics. The method is applied to the DPDCRN architecture (first place in L3DAS23 SE track) and evaluated on single- and multi-channel datasets, claiming consistent improvements to the low-complexity student model and outperformance versus other distillation schemes.

Significance. If the empirical gains are robustly verified, the work could contribute to more effective knowledge distillation for speech enhancement by explicitly leveraging time-frequency differential information for both local focusing and global circulation. The choice of a high-performing baseline (DPDCRN) and evaluation across single- and multi-channel scenarios are strengths that increase potential impact in practical low-complexity SE deployments.

major comments (2)

[Method (multi-layer interactive distillation) and Experiments] The central claim that pairwise multi-layer matching plus dual-stream time-frequency cross-calibration produces measurable, consistent gains by exploiting time-frequency differential information (abstract and method description) is load-bearing for the outperformance assertion. No ablation is described that removes only the cross-calibration (or the recursive inter-set fusion) while holding the rest of the pipeline fixed; without such controls it remains possible that observed improvements arise from basic multi-layer matching or training schedule differences rather than the claimed TF exploitation.
[Experiments / Results] The abstract states that objective evaluations demonstrate improvements and outperformance, yet the provided description supplies no quantitative metrics (e.g., PESQ, STOI, SI-SDR deltas), error bars, statistical tests, or explicit data-split details. These are required to substantiate the claim that the strategy 'consistently and effectively improves' the student model across datasets.

minor comments (2)

[Abstract / Introduction] The acronym I²SRF-TFCKD is introduced without immediate expansion; spelling it out on first use would improve readability.
[Introduction] Ensure all referenced prior distillation strategies for SE are accompanied by specific citations rather than general statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below with clarifications drawn from the manuscript and indicate the revisions we will make.

read point-by-point responses

Referee: [Method (multi-layer interactive distillation) and Experiments] The central claim that pairwise multi-layer matching plus dual-stream time-frequency cross-calibration produces measurable, consistent gains by exploiting time-frequency differential information (abstract and method description) is load-bearing for the outperformance assertion. No ablation is described that removes only the cross-calibration (or the recursive inter-set fusion) while holding the rest of the pipeline fixed; without such controls it remains possible that observed improvements arise from basic multi-layer matching or training schedule differences rather than the claimed TF exploitation.

Authors: We agree that a more granular ablation isolating the dual-stream time-frequency cross-calibration (and separately the recursive inter-set fusion) would strengthen the evidence that gains arise specifically from TF differential exploitation rather than generic multi-layer matching. The current manuscript reports comparisons of the full I²SRF-TFCKD against prior distillation methods and includes component-level analysis within the collaborative paradigm, but does not present the exact controlled removal requested. In the revised manuscript we will add these targeted ablations while holding all other elements (including training schedule and base multi-layer matching) fixed. revision: yes
Referee: [Experiments / Results] The abstract states that objective evaluations demonstrate improvements and outperformance, yet the provided description supplies no quantitative metrics (e.g., PESQ, STOI, SI-SDR deltas), error bars, statistical tests, or explicit data-split details. These are required to substantiate the claim that the strategy 'consistently and effectively improves' the student model across datasets.

Authors: The full manuscript (Section 4 and associated tables) reports concrete quantitative results on both single- and multi-channel datasets, including PESQ, STOI and SI-SDR values with direct comparisons to the teacher, the student without distillation, and competing distillation schemes. Data splits follow the standard partitions of the respective benchmarks (e.g., L3DAS23 and other public SE corpora). To make these findings immediately visible, we will incorporate representative numerical deltas into the abstract and ensure error bars together with any statistical significance statements are explicitly stated or added in the results section. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains from proposed TF-calibrated KD rest on dataset comparisons, not self-referential definitions or fits

full rationale

The paper describes an I²SRF-TFCKD framework that combines intra-set/inter-set recursive fusion with dual-stream time-frequency cross-calibration for distilling a student model from a teacher on speech enhancement tasks. The central claim of consistent improvement over other KD schemes is presented as the outcome of objective evaluations on single- and multi-channel datasets, with no equations, derivations, or parameter-fitting steps shown that would make the reported gains equivalent to the method's own inputs by construction. The described mechanisms (pairwise multi-layer matching, recursive fusion, and cross-weighting of similarity calibration weights) are architectural choices whose contribution is asserted via external benchmark comparisons rather than internal redefinition or self-citation chains. This is a standard empirical ML paper whose validity hinges on reproducibility of the experiments, not on any load-bearing step that collapses to tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The method introduces multiple design choices whose effectiveness is not derived from first principles but asserted through empirical results; without the full text the exact count of free parameters remains unknown.

free parameters (1)

layer-wise matching weights and calibration hyperparameters
The multi-layer interactive distillation and cross-weighting require choices of similarity metrics and weighting functions that are fitted or tuned during training.

pith-pipeline@v0.9.0 · 5805 in / 1279 out tokens · 38097 ms · 2026-05-22T00:45:44.791735+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_add unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

recursive fusion to form the fused feature set that enables inter-set knowledge interaction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

[1]

Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods,

C. Zheng, H. Zhang, W. Liu, X. Luo, A. Li, X. Li, and B. C. Moore, “Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods,” Trends in Hearing , vol. 27, pp. 1–52, 2023

work page 2023
[2]

Multiple statis- tical models for soft decision in noisy speech enhancement,

J.-H. Chang, S. Gazor, N. S. Kim, and S. K. Mitra, “Multiple statis- tical models for soft decision in noisy speech enhancement,” Pattern Recognition, vol. 40, no. 3, pp. 1123–1134, 2007

work page 2007
[3]

A regression approach to speech enhancement based on deep neural networks,

Y . Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM transactions on audio, speech, and language processing , vol. 23, no. 1, pp. 7–19, 2014

work page 2014
[4]

The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,

C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” in Inter- speech 2020 , 2020, pp. 2492–2496

work page 2020
[5]

Fullsubnet: A full-band and sub- band fusion model for real-time single-channel speech enhancement,

X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A full-band and sub- band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 6633–6637. JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXX XXXX 13

work page 2021
[6]

Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,

S. Zhao, B. Ma, K. N. Watcharasupat, and W.-S. Gan, “Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 9281–9285

work page 2022
[7]

Dynamical channel pruning by conditional accuracy change for deep neural networks,

Z. Chen, T.-B. Xu, C. Du, C.-L. Liu, and H. He, “Dynamical channel pruning by conditional accuracy change for deep neural networks,”IEEE Transactions on Neural Networks and Learning Systems , vol. 32, no. 2, pp. 799–813, 2021

work page 2021
[8]

General bitwidth assignment for efficient deep convolutional neural network quantization,

W. Fei, W. Dai, C. Li, J. Zou, and H. Xiong, “General bitwidth assignment for efficient deep convolutional neural network quantization,” IEEE Transactions on Neural Networks and Learning Systems , vol. 33, no. 10, pp. 5253–5267, 2022

work page 2022
[9]

Col- laborative knowledge distillation via multiknowledge transfer,

J. Gou, L. Sun, B. Yu, L. Du, K. Ramamohanarao, and D. Tao, “Col- laborative knowledge distillation via multiknowledge transfer,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 5, pp. 6718–6730, 2024

work page 2024
[10]

Weight, block or unit? exploring sparsity tradeoffs for speech enhancement on tiny neural accelerators,

M. Stamenovic, N. L. Westhausen, L.-C. Yang, C. Jensen, and A. Pawlicki, “Weight, block or unit? exploring sparsity tradeoffs for speech enhancement on tiny neural accelerators,” arXiv preprint arXiv:2111.02351, 2021

work page arXiv 2021
[11]

Towards model compression for deep learning based speech enhancement,

K. Tan and D. Wang, “Towards model compression for deep learning based speech enhancement,” IEEE/ACM transactions on audio, speech, and language processing , vol. 29, pp. 1785–1794, 2021

work page 2021
[12]

Test-time adaptation toward personalized speech enhancement: Zero-shot learning with knowledge distillation,

S. Kim and M. Kim, “Test-time adaptation toward personalized speech enhancement: Zero-shot learning with knowledge distillation,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021, pp. 176–180

work page 2021
[13]

Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation,

M. Thakker, S. E. Eskimez, T. Yoshioka, and H. Wang, “Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation,” in Interspeech 2022, 2022, pp. 991– 995

work page 2022
[14]

Abc-kd: Attention- based-compression knowledge distillation for deep learning-based noise suppression,

Y . Wan, Y . Zhou, X. Peng, K.-W. Chang, and Y . Lu, “Abc-kd: Attention- based-compression knowledge distillation for deep learning-based noise suppression,” in Interspeech 2023 , 2023, pp. 2528–2532

work page 2023
[15]

Two-step knowledge distillation for tiny speech enhancement,

R. D. Nathoo, M. Kegler, and M. Stamenovic, “Two-step knowledge distillation for tiny speech enhancement,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 141–10 145

work page 2024
[16]

Residual fusion probabilistic knowledge distillation for speech en- hancement,

J. Cheng, R. Liang, L. Zhou, L. Zhao, C. Huang, and B. W. Schuller, “Residual fusion probabilistic knowledge distillation for speech en- hancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2680–2691, 2024

work page 2024
[17]

Dual-path dilated convolutional recurrent network with group attention for multi-channel speech enhancement,

J. Cheng, C. Pang, R. Liang, J. Fan, and L. Zhao, “Dual-path dilated convolutional recurrent network with group attention for multi-channel speech enhancement,” in ICASSP 2023-2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–2

work page 2023
[18]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Learning an inference- accelerated network from a pre-trained model with frequency-enhanced feature distillation,

X. Niu, J. Gu, G. Zhang, P. Wan, and Z. Wang, “Learning an inference- accelerated network from a pre-trained model with frequency-enhanced feature distillation,” in Proceedings of the 30th ACM International Conference on Multimedia , 2022, pp. 1847–1856

work page 2022
[20]

Cross- image relational knowledge distillation for semantic segmentation,

C. Yang, H. Zhou, Z. An, X. Jiang, Y . Xu, and Q. Zhang, “Cross- image relational knowledge distillation for semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 319–12 328

work page 2022
[21]

Distilling knowledge via knowledge review,

P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge review,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 5008–5017

work page 2021
[22]

Inherit with distillation and evolve with contrast: Exploring class incremental semantic segmentation without exemplar memory,

D. Zhao, B. Yuan, and Z. Shi, “Inherit with distillation and evolve with contrast: Exploring class incremental semantic segmentation without exemplar memory,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 11 932–11 947, 2023

work page 2023
[23]

Knowledge distillation from transformers for low-complexity acoustic scene clas- sification

F. Schmid, S. Masoudian, K. Koutini, and G. Widmer, “Knowledge distillation from transformers for low-complexity acoustic scene clas- sification.” in DCASE, 2022

work page 2022
[24]

Audio-visual representa- tion learning via knowledge distillation from speech foundation models,

J.-X. Zhang, G. Wan, J. Gao, and Z.-H. Ling, “Audio-visual representa- tion learning via knowledge distillation from speech foundation models,” Pattern Recognition, p. 111432, 2025

work page 2025
[25]

Sub-band knowledge distillation framework for speech enhancement,

X. Hao, S. Wen, X. Su, Y . Liu, G. Gao, and X. Li, “Sub-band knowledge distillation framework for speech enhancement,” in Interspeech 2020 , 2020, pp. 2687–2691

work page 2020
[26]

Cross-layer similarity knowledge distillation for speech enhancement

J. Cheng, R. Liang, Y . Xie, L. Zhao, B. Schuller, J. Jia, and Y . Peng, “Cross-layer similarity knowledge distillation for speech enhancement.” in INTERSPEECH, 2022, pp. 926–930

work page 2022
[27]

Multi-view attention transfer for efficient speech enhancement,

W. Shin, H. J. Park, J. S. Kim, B. H. Lee, and S. W. Han, “Multi-view attention transfer for efficient speech enhancement,” inInterspeech 2022, 2022, pp. 1198–1202

work page 2022
[28]

Dynamic frequency-adaptive knowledge distillation for speech enhancement,

X. Yuan, S. Liu, H. Chen, L. Zhou, J. Li, and J. Hu, “Dynamic frequency-adaptive knowledge distillation for speech enhancement,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2025, pp. 1–5

work page 2025
[29]

Distil-dccrn: A small- footprint dccrn leveraging feature-based knowledge distillation in speech enhancement,

R. Han, W. Xu, Z. Zhang, M. Liu, and L. Xie, “Distil-dccrn: A small- footprint dccrn leveraging feature-based knowledge distillation in speech enhancement,” IEEE Signal Processing Letters , 2024

work page 2024
[30]

Semckd: Semantic calibration for cross-layer knowledge distillation,

C. Wang, D. Chen, J.-P. Mei, Y . Zhang, Y . Feng, and C. Chen, “Semckd: Semantic calibration for cross-layer knowledge distillation,” IEEE Transactions on Knowledge and Data Engineering , vol. 35, no. 6, pp. 6305–6319, 2022

work page 2022
[31]

Real time speech enhancement in the waveform domain,

A. D ´efossez, G. Synnaeve, and Y . Adi, “Real time speech enhancement in the waveform domain,” in Interspeech 2020 , 2020, pp. 3291–3295

work page 2020
[32]

Teacher-student learn- ing for low-latency online speech enhancement using wave-u-net,

S. Nakaoka, L. Li, S. Inoue, and S. Makino, “Teacher-student learn- ing for low-latency online speech enhancement using wave-u-net,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 661–665

work page 2021
[33]

FitNets: Hints for Thin Deep Nets

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Ben- gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[34]

A hybrid dsp/deep learning approach to real-time full- band speech enhancement,

J.-M. Valin, “A hybrid dsp/deep learning approach to real-time full- band speech enhancement,” in 2018 IEEE 20th international workshop on multimedia signal processing (MMSP) . IEEE, 2018, pp. 1–5

work page 2018
[35]

Weighted speech distortion losses for neural-network-based real-time speech enhancement,

Y . Xia, S. Braun, C. K. Reddy, H. Dubey, R. Cutler, and I. Tashev, “Weighted speech distortion losses for neural-network-based real-time speech enhancement,” in ICASSP 2020-2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 871–875

work page 2020
[36]

Dual-signal transformation lstm network for real-time noise suppression,

N. L. Westhausen and B. T. Meyer, “Dual-signal transformation lstm network for real-time noise suppression,” in Interspeech 2020 , 2020, pp. 2477–2481

work page 2020
[37]

Dccrn: Deep complex convolution recurrent network for phase- aware speech enhancement,

Y . Hu, Y . Liu, S. Lv, M. Xing, S. Zhang, Y . Fu, J. Wu, B. Zhang, and L. Xie, “Dccrn: Deep complex convolution recurrent network for phase- aware speech enhancement,” in Interspeech 2020, 2020, pp. 2472–2476

work page 2020
[38]

Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement,

J. Chen, Z. Wang, D. Tuo, Z. Wu, S. Kang, and H. Meng, “Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 7857–7861

work page 2022
[39]

Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,

A. Li, W. Liu, C. Zheng, C. Fan, and X. Li, “Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1829–1843, 2021

work page 2021
[40]

Primek-net: Multi-scale spectral learning via group prime-kernel convolutional neural networks for single channel speech enhancement,

Z. Lin, J. Wang, R. Li, F. Shen, and X. Xuan, “Primek-net: Multi-scale spectral learning via group prime-kernel convolutional neural networks for single channel speech enhancement,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025
[41]

A neural beamforming network for b-format 3d speech enhance- ment and recognition,

X. Ren, L. Chen, X. Zheng, C. Xu, X. Zhang, C. Zhang, L. Guo, and B. Yu, “A neural beamforming network for b-format 3d speech enhance- ment and recognition,” in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP) . IEEE, 2021, pp. 1–6

work page 2021
[42]

Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,

A. Li, W. Liu, C. Zheng, and X. Li, “Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 6487–6491

work page 2022
[43]

Stream attention based u-net for l3das23 challenge,

H. Wang, Y . Fu, J. Li, M. Ge, L. Wang, and X. Qian, “Stream attention based u-net for l3das23 challenge,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–2

work page 2023
[44]

Deft-an: Dense frequency-time attentive net- work for multichannel speech enhancement,

D. Lee and J.-W. Choi, “Deft-an: Dense frequency-time attentive net- work for multichannel speech enhancement,” IEEE Signal Processing Letters, vol. 30, pp. 155–159, 2023

work page 2023
[45]

Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,

C. Quan and X. Li, “Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 32, pp. 1310–1323, 2024

work page 2024
[46]

Per- ceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) , vol. 2. IEEE, 2001, pp. 749–752

work page 2001
[47]

An algorithm for intelligibility prediction of time–frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXX XXXX 14 IEEE Transactions on audio, speech, and language processing , vol. 19, no. 7, pp. 2125–2136, 2011

work page 2011
[48]

Sdr–half-baked or well done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 626–630

work page 2019
[49]

Evaluation of objective quality measures for speech enhancement,

Y . Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on audio, speech, and language processing, vol. 16, no. 1, pp. 229–238, 2007

work page 2007
[50]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems , vol. 33, pp. 12 449– 12 460, 2020

work page 2020
[51]

Statistical methods for research workers,

R. A. Fisher, “Statistical methods for research workers,” in Break- throughs in statistics: Methodology and distribution . Springer, 1970, pp. 66–70. Jiaming Cheng received the PhD degree from Southeast University, Nanjing, China, in 2024. He is currently a Lecturer with the School of Com- puter Science, Nanjing Audit University, Nanjing, China. His resea...

work page 1970
[52]

His research interests include big data technology and artificial intelligence

He is currently a Professor with the School of Computer Science, Nanjing Audit University, Nanjing, China. His research interests include big data technology and artificial intelligence. Ye Ni received the M.S. degree from Nanjing University, Nanjing, China, in 2022. He is cur- rently working toward a PhD degree from Southeast University, Nanjing, China. ...

work page 2022

[1] [1]

Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods,

C. Zheng, H. Zhang, W. Liu, X. Luo, A. Li, X. Li, and B. C. Moore, “Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods,” Trends in Hearing , vol. 27, pp. 1–52, 2023

work page 2023

[2] [2]

Multiple statis- tical models for soft decision in noisy speech enhancement,

J.-H. Chang, S. Gazor, N. S. Kim, and S. K. Mitra, “Multiple statis- tical models for soft decision in noisy speech enhancement,” Pattern Recognition, vol. 40, no. 3, pp. 1123–1134, 2007

work page 2007

[3] [3]

A regression approach to speech enhancement based on deep neural networks,

Y . Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM transactions on audio, speech, and language processing , vol. 23, no. 1, pp. 7–19, 2014

work page 2014

[4] [4]

The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,

C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” in Inter- speech 2020 , 2020, pp. 2492–2496

work page 2020

[5] [5]

Fullsubnet: A full-band and sub- band fusion model for real-time single-channel speech enhancement,

X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A full-band and sub- band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 6633–6637. JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXX XXXX 13

work page 2021

[6] [6]

Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,

S. Zhao, B. Ma, K. N. Watcharasupat, and W.-S. Gan, “Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 9281–9285

work page 2022

[7] [7]

Dynamical channel pruning by conditional accuracy change for deep neural networks,

Z. Chen, T.-B. Xu, C. Du, C.-L. Liu, and H. He, “Dynamical channel pruning by conditional accuracy change for deep neural networks,”IEEE Transactions on Neural Networks and Learning Systems , vol. 32, no. 2, pp. 799–813, 2021

work page 2021

[8] [8]

General bitwidth assignment for efficient deep convolutional neural network quantization,

W. Fei, W. Dai, C. Li, J. Zou, and H. Xiong, “General bitwidth assignment for efficient deep convolutional neural network quantization,” IEEE Transactions on Neural Networks and Learning Systems , vol. 33, no. 10, pp. 5253–5267, 2022

work page 2022

[9] [9]

Col- laborative knowledge distillation via multiknowledge transfer,

J. Gou, L. Sun, B. Yu, L. Du, K. Ramamohanarao, and D. Tao, “Col- laborative knowledge distillation via multiknowledge transfer,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 5, pp. 6718–6730, 2024

work page 2024

[10] [10]

Weight, block or unit? exploring sparsity tradeoffs for speech enhancement on tiny neural accelerators,

M. Stamenovic, N. L. Westhausen, L.-C. Yang, C. Jensen, and A. Pawlicki, “Weight, block or unit? exploring sparsity tradeoffs for speech enhancement on tiny neural accelerators,” arXiv preprint arXiv:2111.02351, 2021

work page arXiv 2021

[11] [11]

Towards model compression for deep learning based speech enhancement,

K. Tan and D. Wang, “Towards model compression for deep learning based speech enhancement,” IEEE/ACM transactions on audio, speech, and language processing , vol. 29, pp. 1785–1794, 2021

work page 2021

[12] [12]

Test-time adaptation toward personalized speech enhancement: Zero-shot learning with knowledge distillation,

S. Kim and M. Kim, “Test-time adaptation toward personalized speech enhancement: Zero-shot learning with knowledge distillation,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021, pp. 176–180

work page 2021

[13] [13]

Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation,

M. Thakker, S. E. Eskimez, T. Yoshioka, and H. Wang, “Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation,” in Interspeech 2022, 2022, pp. 991– 995

work page 2022

[14] [14]

Abc-kd: Attention- based-compression knowledge distillation for deep learning-based noise suppression,

Y . Wan, Y . Zhou, X. Peng, K.-W. Chang, and Y . Lu, “Abc-kd: Attention- based-compression knowledge distillation for deep learning-based noise suppression,” in Interspeech 2023 , 2023, pp. 2528–2532

work page 2023

[15] [15]

Two-step knowledge distillation for tiny speech enhancement,

R. D. Nathoo, M. Kegler, and M. Stamenovic, “Two-step knowledge distillation for tiny speech enhancement,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 141–10 145

work page 2024

[16] [16]

Residual fusion probabilistic knowledge distillation for speech en- hancement,

J. Cheng, R. Liang, L. Zhou, L. Zhao, C. Huang, and B. W. Schuller, “Residual fusion probabilistic knowledge distillation for speech en- hancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2680–2691, 2024

work page 2024

[17] [17]

Dual-path dilated convolutional recurrent network with group attention for multi-channel speech enhancement,

J. Cheng, C. Pang, R. Liang, J. Fan, and L. Zhao, “Dual-path dilated convolutional recurrent network with group attention for multi-channel speech enhancement,” in ICASSP 2023-2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–2

work page 2023

[18] [18]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

Learning an inference- accelerated network from a pre-trained model with frequency-enhanced feature distillation,

X. Niu, J. Gu, G. Zhang, P. Wan, and Z. Wang, “Learning an inference- accelerated network from a pre-trained model with frequency-enhanced feature distillation,” in Proceedings of the 30th ACM International Conference on Multimedia , 2022, pp. 1847–1856

work page 2022

[20] [20]

Cross- image relational knowledge distillation for semantic segmentation,

C. Yang, H. Zhou, Z. An, X. Jiang, Y . Xu, and Q. Zhang, “Cross- image relational knowledge distillation for semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 319–12 328

work page 2022

[21] [21]

Distilling knowledge via knowledge review,

P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge review,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 5008–5017

work page 2021

[22] [22]

Inherit with distillation and evolve with contrast: Exploring class incremental semantic segmentation without exemplar memory,

D. Zhao, B. Yuan, and Z. Shi, “Inherit with distillation and evolve with contrast: Exploring class incremental semantic segmentation without exemplar memory,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 11 932–11 947, 2023

work page 2023

[23] [23]

Knowledge distillation from transformers for low-complexity acoustic scene clas- sification

F. Schmid, S. Masoudian, K. Koutini, and G. Widmer, “Knowledge distillation from transformers for low-complexity acoustic scene clas- sification.” in DCASE, 2022

work page 2022

[24] [24]

Audio-visual representa- tion learning via knowledge distillation from speech foundation models,

J.-X. Zhang, G. Wan, J. Gao, and Z.-H. Ling, “Audio-visual representa- tion learning via knowledge distillation from speech foundation models,” Pattern Recognition, p. 111432, 2025

work page 2025

[25] [25]

Sub-band knowledge distillation framework for speech enhancement,

X. Hao, S. Wen, X. Su, Y . Liu, G. Gao, and X. Li, “Sub-band knowledge distillation framework for speech enhancement,” in Interspeech 2020 , 2020, pp. 2687–2691

work page 2020

[26] [26]

Cross-layer similarity knowledge distillation for speech enhancement

J. Cheng, R. Liang, Y . Xie, L. Zhao, B. Schuller, J. Jia, and Y . Peng, “Cross-layer similarity knowledge distillation for speech enhancement.” in INTERSPEECH, 2022, pp. 926–930

work page 2022

[27] [27]

Multi-view attention transfer for efficient speech enhancement,

W. Shin, H. J. Park, J. S. Kim, B. H. Lee, and S. W. Han, “Multi-view attention transfer for efficient speech enhancement,” inInterspeech 2022, 2022, pp. 1198–1202

work page 2022

[28] [28]

Dynamic frequency-adaptive knowledge distillation for speech enhancement,

X. Yuan, S. Liu, H. Chen, L. Zhou, J. Li, and J. Hu, “Dynamic frequency-adaptive knowledge distillation for speech enhancement,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2025, pp. 1–5

work page 2025

[29] [29]

Distil-dccrn: A small- footprint dccrn leveraging feature-based knowledge distillation in speech enhancement,

R. Han, W. Xu, Z. Zhang, M. Liu, and L. Xie, “Distil-dccrn: A small- footprint dccrn leveraging feature-based knowledge distillation in speech enhancement,” IEEE Signal Processing Letters , 2024

work page 2024

[30] [30]

Semckd: Semantic calibration for cross-layer knowledge distillation,

C. Wang, D. Chen, J.-P. Mei, Y . Zhang, Y . Feng, and C. Chen, “Semckd: Semantic calibration for cross-layer knowledge distillation,” IEEE Transactions on Knowledge and Data Engineering , vol. 35, no. 6, pp. 6305–6319, 2022

work page 2022

[31] [31]

Real time speech enhancement in the waveform domain,

A. D ´efossez, G. Synnaeve, and Y . Adi, “Real time speech enhancement in the waveform domain,” in Interspeech 2020 , 2020, pp. 3291–3295

work page 2020

[32] [32]

Teacher-student learn- ing for low-latency online speech enhancement using wave-u-net,

S. Nakaoka, L. Li, S. Inoue, and S. Makino, “Teacher-student learn- ing for low-latency online speech enhancement using wave-u-net,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 661–665

work page 2021

[33] [33]

FitNets: Hints for Thin Deep Nets

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Ben- gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[34] [34]

A hybrid dsp/deep learning approach to real-time full- band speech enhancement,

J.-M. Valin, “A hybrid dsp/deep learning approach to real-time full- band speech enhancement,” in 2018 IEEE 20th international workshop on multimedia signal processing (MMSP) . IEEE, 2018, pp. 1–5

work page 2018

[35] [35]

Weighted speech distortion losses for neural-network-based real-time speech enhancement,

Y . Xia, S. Braun, C. K. Reddy, H. Dubey, R. Cutler, and I. Tashev, “Weighted speech distortion losses for neural-network-based real-time speech enhancement,” in ICASSP 2020-2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 871–875

work page 2020

[36] [36]

Dual-signal transformation lstm network for real-time noise suppression,

N. L. Westhausen and B. T. Meyer, “Dual-signal transformation lstm network for real-time noise suppression,” in Interspeech 2020 , 2020, pp. 2477–2481

work page 2020

[37] [37]

Dccrn: Deep complex convolution recurrent network for phase- aware speech enhancement,

Y . Hu, Y . Liu, S. Lv, M. Xing, S. Zhang, Y . Fu, J. Wu, B. Zhang, and L. Xie, “Dccrn: Deep complex convolution recurrent network for phase- aware speech enhancement,” in Interspeech 2020, 2020, pp. 2472–2476

work page 2020

[38] [38]

Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement,

J. Chen, Z. Wang, D. Tuo, Z. Wu, S. Kang, and H. Meng, “Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 7857–7861

work page 2022

[39] [39]

Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,

A. Li, W. Liu, C. Zheng, C. Fan, and X. Li, “Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1829–1843, 2021

work page 2021

[40] [40]

Primek-net: Multi-scale spectral learning via group prime-kernel convolutional neural networks for single channel speech enhancement,

Z. Lin, J. Wang, R. Li, F. Shen, and X. Xuan, “Primek-net: Multi-scale spectral learning via group prime-kernel convolutional neural networks for single channel speech enhancement,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025

[41] [41]

A neural beamforming network for b-format 3d speech enhance- ment and recognition,

X. Ren, L. Chen, X. Zheng, C. Xu, X. Zhang, C. Zhang, L. Guo, and B. Yu, “A neural beamforming network for b-format 3d speech enhance- ment and recognition,” in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP) . IEEE, 2021, pp. 1–6

work page 2021

[42] [42]

Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,

A. Li, W. Liu, C. Zheng, and X. Li, “Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2022, pp. 6487–6491

work page 2022

[43] [43]

Stream attention based u-net for l3das23 challenge,

H. Wang, Y . Fu, J. Li, M. Ge, L. Wang, and X. Qian, “Stream attention based u-net for l3das23 challenge,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–2

work page 2023

[44] [44]

Deft-an: Dense frequency-time attentive net- work for multichannel speech enhancement,

D. Lee and J.-W. Choi, “Deft-an: Dense frequency-time attentive net- work for multichannel speech enhancement,” IEEE Signal Processing Letters, vol. 30, pp. 155–159, 2023

work page 2023

[45] [45]

Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,

C. Quan and X. Li, “Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 32, pp. 1310–1323, 2024

work page 2024

[46] [46]

Per- ceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) , vol. 2. IEEE, 2001, pp. 749–752

work page 2001

[47] [47]

An algorithm for intelligibility prediction of time–frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXX XXXX 14 IEEE Transactions on audio, speech, and language processing , vol. 19, no. 7, pp. 2125–2136, 2011

work page 2011

[48] [48]

Sdr–half-baked or well done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 626–630

work page 2019

[49] [49]

Evaluation of objective quality measures for speech enhancement,

Y . Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on audio, speech, and language processing, vol. 16, no. 1, pp. 229–238, 2007

work page 2007

[50] [50]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems , vol. 33, pp. 12 449– 12 460, 2020

work page 2020

[51] [51]

Statistical methods for research workers,

R. A. Fisher, “Statistical methods for research workers,” in Break- throughs in statistics: Methodology and distribution . Springer, 1970, pp. 66–70. Jiaming Cheng received the PhD degree from Southeast University, Nanjing, China, in 2024. He is currently a Lecturer with the School of Com- puter Science, Nanjing Audit University, Nanjing, China. His resea...

work page 1970

[52] [52]

His research interests include big data technology and artificial intelligence

He is currently a Professor with the School of Computer Science, Nanjing Audit University, Nanjing, China. His research interests include big data technology and artificial intelligence. Ye Ni received the M.S. degree from Nanjing University, Nanjing, China, in 2022. He is cur- rently working toward a PhD degree from Southeast University, Nanjing, China. ...

work page 2022