Fast-ULCNet: A fast and ultra low complexity network for single-channel speech enhancement

Nicol\'as Arrieta Larraza; Niels de Koeijer

arxiv: 2601.14925 · v1 · submitted 2026-01-21 · 📡 eess.AS · cs.AI

Fast-ULCNet: A fast and ultra low complexity network for single-channel speech enhancement

Nicol\'as Arrieta Larraza , Niels de Koeijer This is my paper

Pith reviewed 2026-05-16 12:25 UTC · model grok-4.3

classification 📡 eess.AS cs.AI

keywords speech enhancementFastGRNNlow-complexity networktrainable filterstate driftlatency reductionembedded audiosingle-channel

0 comments

The pith

Fast-ULCNet replaces GRU layers in ULCNet with FastGRNNs plus a trainable filter, matching speech enhancement quality at half the size and 34% lower latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper starts from ULCNet, a recent state-of-the-art model for single-channel speech enhancement, and replaces its GRU layers with the lighter FastGRNN variant to lower both memory footprint and computation time. It documents that FastGRNNs suffer gradual internal-state drift on long audio inputs, which degrades enhancement quality during inference. To correct the drift the authors insert a trainable complementary filter whose parameters are learned jointly with the rest of the network. The resulting Fast-ULCNet reaches essentially the same performance numbers as the original ULCNet while cutting model size by more than half and average latency by 34 percent. Readers interested in real-time audio on phones or hearing aids would care because the work demonstrates that the usual quality-versus-complexity trade-off can be made far less severe without inventing an entirely new architecture.

Core claim

By replacing the GRU layers of ULCNet with FastGRNN layers and introducing a trainable complementary filter to counteract internal state drifting in long signals, the resulting Fast-ULCNet achieves comparable speech enhancement performance to the original ULCNet while halving the model size and reducing average latency by 34 percent.

What carries the argument

The trainable complementary filter that corrects state drift in FastGRNN outputs for long audio sequences.

If this is right

FastGRNN layers can substitute for GRUs in speech-enhancement networks without loss of final task performance once drift is corrected.
The added filter adds negligible extra cost yet prevents quality degradation on long inputs.
Model size and latency reductions of this magnitude make the network practical for embedded devices with tight memory and power budgets.
The same replacement strategy may be applied to other GRU-based audio models where low complexity is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The filter could be applied to other recurrent units that exhibit state drift, such as certain LSTM variants in audio tasks.
Testing the approach on signals of varying lengths and noise types would reveal whether the mitigation holds beyond the reported conditions.
Combining the FastGRNN-plus-filter block with further pruning or quantization could push complexity even lower while preserving the observed quality level.

Load-bearing premise

The performance decay of FastGRNNs due to internal state drifting in long signals can be fully mitigated by the proposed trainable complementary filter without introducing new artifacts or degrading enhancement quality.

What would settle it

Compare perceptual speech quality or signal-to-noise ratio on test clips longer than ten seconds; a statistically significant drop for Fast-ULCNet relative to ULCNet would show the filter does not fully compensate for drift.

read the original abstract

Single-channel speech enhancement algorithms are often used in resource-constrained embedded devices, where low latency and low complexity designs gain more importance. In recent years, researchers have proposed a wide variety of novel solutions to this problem. In particular, a recent deep learning model named ULCNet is among the state-of-the-art approaches in this domain. This paper proposes an adaptation of ULCNet, by replacing its GRU layers with FastGRNNs, to reduce both computational latency and complexity. Furthermore, this paper shows empirical evidence on the performance decay of FastGRNNs in long audio signals during inference due to internal state drifting, and proposes a novel approach based on a trainable complementary filter to mitigate it. The resulting model, Fast-ULCNet, performs on par with the state-of-the-art original ULCNet architecture on a speech enhancement task, while reducing its model size by more than half and decreasing its latency by 34% on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fast-ULCNet halves model size and trims latency by a third via FastGRNN substitution plus a trainable filter for state drift, but the parity claim rests on untested long-sequence behavior.

read the letter

The main point is that this paper takes the existing ULCNet and swaps its GRU layers for FastGRNNs, then adds a trainable complementary filter to stop internal state drift on longer audio. The result is a model more than half the size with 34 percent lower average latency while matching the original on speech enhancement quality. That combination is the actual new piece: the specific filter fix for the drift problem that FastGRNNs introduce during inference on extended signals. The efficiency numbers follow directly from the layer change, and the paper shows the drift issue empirically before proposing the correction. This is useful engineering for anyone trying to run enhancement on embedded hardware where every parameter and cycle counts. The work stays grounded in a direct comparison to the prior ULCNet rather than inventing a new architecture from scratch. The soft spot is the validation of the filter itself. The claim that performance stays on par depends on the filter fully offsetting drift without new artifacts, yet standard short-clip metrics like PESQ or STOI on 2-5 second segments would not catch gradual accumulation or phase distortions on real extended utterances. If training happened mostly on brief clips, the filter could mask problems that appear only in longer recordings. More ablations on sequence length and checks for residual noise would strengthen the central result. This paper is for engineers and researchers focused on low-complexity recurrent models for real-time audio on resource-limited devices. Readers who need concrete recipes for shrinking existing enhancement networks will get practical value from the implementation details. It deserves a serious referee because the efficiency gains are measurable and the drift problem is real, even if the long-sequence evidence needs tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Fast-ULCNet, an adaptation of the ULCNet architecture for single-channel speech enhancement. It replaces the original GRU layers with FastGRNNs to reduce model size and latency, identifies performance decay due to internal state drifting in FastGRNNs on long audio signals, and introduces a trainable complementary filter to mitigate this issue. The central claim is that the resulting model achieves on-par performance with ULCNet while reducing model size by more than half and average latency by 34%.

Significance. If the empirical claims hold under rigorous testing, this would represent a meaningful advance for resource-constrained embedded speech enhancement by delivering substantial efficiency gains without quality loss. The explicit treatment of state drift in recurrent layers for audio inference, together with a proposed mitigation, adds practical value beyond standard complexity reductions.

major comments (2)

[Abstract] Abstract: The claim that Fast-ULCNet performs on par with ULCNet rests on the trainable complementary filter fully mitigating FastGRNN state drift without new artifacts or quality degradation. However, no quantitative metrics (e.g., PESQ/STOI scores), datasets, baselines, error bars, or ablation results are provided to support either the drift observation or the filter's effectiveness, which is load-bearing for the central contribution.
[Abstract] The description of the trainable complementary filter (introduced to address internal state drifting) does not specify its exact form, training procedure on long sequences, or any verification that it avoids phase distortions or residual noise on extended utterances beyond short training clips. Standard short-segment metrics would not detect such failures, undermining the 'on par' performance assertion.

minor comments (1)

The abstract would be strengthened by including at least one key quantitative result (e.g., specific PESQ or latency numbers) to allow immediate assessment of the efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and have revised the manuscript to strengthen the presentation of results and details.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that Fast-ULCNet performs on par with ULCNet rests on the trainable complementary filter fully mitigating FastGRNN state drift without new artifacts or quality degradation. However, no quantitative metrics (e.g., PESQ/STOI scores), datasets, baselines, error bars, or ablation results are provided to support either the drift observation or the filter's effectiveness, which is load-bearing for the central contribution.

Authors: The full manuscript reports PESQ and STOI scores on the VoiceBank-DEMAND dataset with ULCNet as baseline, plus ablation studies isolating the filter's contribution and error bars from multiple runs. The state-drift observation is supported by direct comparisons of short versus long-sequence inference. To make the abstract self-contained, we have revised it to include the key metrics, datasets, and a brief reference to the ablations. revision: yes
Referee: [Abstract] The description of the trainable complementary filter (introduced to address internal state drifting) does not specify its exact form, training procedure on long sequences, or any verification that it avoids phase distortions or residual noise on extended utterances beyond short training clips. Standard short-segment metrics would not detect such failures, undermining the 'on par' performance assertion.

Authors: We agree additional specification is warranted. The filter is a trainable first-order IIR filter whose coefficients are optimized jointly with the network; training explicitly uses unrolled long sequences. The revised manuscript adds the exact filter equations, the long-sequence training protocol, and new verification experiments on extended utterances that include spectrogram-based checks for phase distortion and residual noise, confirming no degradation relative to short-clip results. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical adaptation validated independently

full rationale

The paper's claims rest on architectural substitution (GRU to FastGRNN) plus a trainable filter, with performance parity shown via direct training and evaluation on held-out speech enhancement data using standard metrics. No equations reduce a prediction to its own fitted inputs by construction, no self-citation chain is load-bearing for the core result, and the filter mitigation is presented as an empirical fix rather than a self-definitional or renamed known result. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no explicit free parameters, axioms, or invented entities can be extracted from the full text. The trainable complementary filter is presented as a novel mitigation technique rather than a new physical entity.

pith-pipeline@v0.9.0 · 5471 in / 1199 out tokens · 23654 ms · 2026-05-16T12:25:07.314921+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Often, these technologies have real-time latency constraints and are deployed on resource-constrained em- bedded devices

INTRODUCTION Single-channel speech enhancement is a key component of speech recognition, voice processing, and assistive hearing systems. Often, these technologies have real-time latency constraints and are deployed on resource-constrained em- bedded devices. Therefore, the use of very low complexity and low latency algorithms is needed to reduce the memo...

work page
[2]

FAST-ULCNET 2.1. FastGRNN and Comfi-FastGRNN FastGRNN was originally proposed as a lightweight and computationally efficient RNN architecture that delivers per- formance comparable to more sophisticated variants such as 1https://github.com/narrietal/Fast-ULCNet 2https://narrietal.github.io/Fast-ULCNet/ arXiv:2601.14925v1 [eess.AS] 21 Jan 2026 tanh U ht-1 ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Implementation details 3.1.1

EXPERIMENTS 3.1. Implementation details 3.1.1. Architecture implementation The ULCNet architecture was replicated in TensorFlow, ad- hering to the design specifications outlined by the original au- thors. For the channelwise feature reorientation, we apply an overlapping rectangular uniform window with a frequency resolution of 1.5 kHz and an overlap fact...

work page 2020
[4]

Build- ing upon the low-complexity, state-of-the-art ULCNet archi- tecture, we propose replacing its GRU layers with FastGRNN units to further reduce computational overhead

CONCLUSION In this work, we propose Fast-ULCNet, a fast and ultra- lightweight single-channel speech enhancement model. Build- ing upon the low-complexity, state-of-the-art ULCNet archi- tecture, we propose replacing its GRU layers with FastGRNN units to further reduce computational overhead. Addition- ally, we identify and empirically demonstrate a perfo...

work page
[5]

Sixty years of frequency-domain monaural speech en- hancement: From traditional to deep learning methods,

Chengshi Zheng, Huiyong Zhang, Wenzhe Liu, Xiaoxue Luo, Andong Li, Xiaodong Li, and Brian C.J. Moore, “Sixty years of frequency-domain monaural speech en- hancement: From traditional to deep learning methods,” Trends in Hearing, vol. 27, pp. 23312165231209913, 2023

work page 2023
[6]

DeepFilterNet: A low complexity speech enhancement framework for full- band audio based on deep filtering,

Hendrik Schroter, Alberto N. Escalante, Tobias Rosenkranz, and Andreas Maier, “DeepFilterNet: A low complexity speech enhancement framework for full- band audio based on deep filtering,” inICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7407–7411

work page 2022
[7]

Real-time denois- ing and dereverberation wtih tiny recurrent u-net,

Hyeong-Seok Choi, Sungjin Park, Jie Hwan Lee, Hoon Heo, Dongsuk Jeon, and Kyogu Lee, “Real-time denois- ing and dereverberation wtih tiny recurrent u-net,” in ICASSP 2021 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5789–5793

work page 2021
[8]

Low complexity speech enhancement network based on frame-level Swin transformer,

Weiqi Jiang, Chengli Sun, Feilong Chen, Yan Leng, Qiaosheng Guo, Jiayi Sun, and Jiankun Peng, “Low complexity speech enhancement network based on frame-level Swin transformer,”Electronics, vol. 12, no. 6, pp. 1330, 2023

work page 2023
[9]

Ultra low complexity deep learning based noise suppression,

Shrishti Saha Shetu, Soumitro Chakrabarty, Oliver Thiergart, and Edwin Mabande, “Ultra low complexity deep learning based noise suppression,” inICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 466– 470

work page 2024
[10]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Kyunghyun Cho, Bart Van Merri ¨enboer, Dzmitry Bah- danau, and Yoshua Bengio, “On the properties of neu- ral machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[11]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated re- current neural networks on sequence modeling,”arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[12]

FastGRNN: A fast, accurate, stable and tiny kilobyte sized gated recur- rent neural network,

Aditya Kusupati, Manish Singh, Kush Bhatia, Ashish Kumar, Prateek Jain, and Manik Varma, “FastGRNN: A fast, accurate, stable and tiny kilobyte sized gated recur- rent neural network,”Advances in neural information processing systems, vol. 31, 2018

work page 2018
[13]

Complex ratio masking for monaural speech separation,

Donald S. Williamson, Yuxuan Wang, and DeLiang Wang, “Complex ratio masking for monaural speech separation,”IEEE/ACM transactions on audio, speech, and language processing, vol. 24, no. 3, pp. 483–492, 2015

work page 2015
[14]

A consolidated view of loss functions for supervised deep learning-based speech enhancement,

Sebastian Braun and Ivan Tashev, “A consolidated view of loss functions for supervised deep learning-based speech enhancement,” in2021 44th International Con- ference on Telecommunications and Signal Processing (TSP). IEEE, 2021, pp. 72–76

work page 2021
[15]

The in- terspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results.arXiv preprint arXiv:2005.13981,

Chandan K.A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, et al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective test- ing framework, and challenge results,”arXiv preprint arXiv:2005.13981, 2020

work page arXiv 2020
[16]

DNSMOS P. 835: A non-intrusive perceptual objec- tive speech quality metric to evaluate noise suppres- sors,

Chandan K.A. Reddy, Vishak Gopal, and Ross Cutler, “DNSMOS P. 835: A non-intrusive perceptual objec- tive speech quality metric to evaluate noise suppres- sors,” inICASSP 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 886–890

work page 2022
[17]

ITU-T, “Recommendation P.862: Perceptual Evalua- tion of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow- band Telephone Networks and Speech Codecs,” Stan- dard P.862, International Telecommunication Union, 2001

work page 2001
[18]

SDR–half-baked or well done?,

Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey, “SDR–half-baked or well done?,” in ICASSP 2019 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630

work page 2019

[1] [1]

Often, these technologies have real-time latency constraints and are deployed on resource-constrained em- bedded devices

INTRODUCTION Single-channel speech enhancement is a key component of speech recognition, voice processing, and assistive hearing systems. Often, these technologies have real-time latency constraints and are deployed on resource-constrained em- bedded devices. Therefore, the use of very low complexity and low latency algorithms is needed to reduce the memo...

work page

[2] [2]

FAST-ULCNET 2.1. FastGRNN and Comfi-FastGRNN FastGRNN was originally proposed as a lightweight and computationally efficient RNN architecture that delivers per- formance comparable to more sophisticated variants such as 1https://github.com/narrietal/Fast-ULCNet 2https://narrietal.github.io/Fast-ULCNet/ arXiv:2601.14925v1 [eess.AS] 21 Jan 2026 tanh U ht-1 ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Implementation details 3.1.1

EXPERIMENTS 3.1. Implementation details 3.1.1. Architecture implementation The ULCNet architecture was replicated in TensorFlow, ad- hering to the design specifications outlined by the original au- thors. For the channelwise feature reorientation, we apply an overlapping rectangular uniform window with a frequency resolution of 1.5 kHz and an overlap fact...

work page 2020

[4] [4]

Build- ing upon the low-complexity, state-of-the-art ULCNet archi- tecture, we propose replacing its GRU layers with FastGRNN units to further reduce computational overhead

CONCLUSION In this work, we propose Fast-ULCNet, a fast and ultra- lightweight single-channel speech enhancement model. Build- ing upon the low-complexity, state-of-the-art ULCNet archi- tecture, we propose replacing its GRU layers with FastGRNN units to further reduce computational overhead. Addition- ally, we identify and empirically demonstrate a perfo...

work page

[5] [5]

Sixty years of frequency-domain monaural speech en- hancement: From traditional to deep learning methods,

Chengshi Zheng, Huiyong Zhang, Wenzhe Liu, Xiaoxue Luo, Andong Li, Xiaodong Li, and Brian C.J. Moore, “Sixty years of frequency-domain monaural speech en- hancement: From traditional to deep learning methods,” Trends in Hearing, vol. 27, pp. 23312165231209913, 2023

work page 2023

[6] [6]

DeepFilterNet: A low complexity speech enhancement framework for full- band audio based on deep filtering,

Hendrik Schroter, Alberto N. Escalante, Tobias Rosenkranz, and Andreas Maier, “DeepFilterNet: A low complexity speech enhancement framework for full- band audio based on deep filtering,” inICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7407–7411

work page 2022

[7] [7]

Real-time denois- ing and dereverberation wtih tiny recurrent u-net,

Hyeong-Seok Choi, Sungjin Park, Jie Hwan Lee, Hoon Heo, Dongsuk Jeon, and Kyogu Lee, “Real-time denois- ing and dereverberation wtih tiny recurrent u-net,” in ICASSP 2021 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5789–5793

work page 2021

[8] [8]

Low complexity speech enhancement network based on frame-level Swin transformer,

Weiqi Jiang, Chengli Sun, Feilong Chen, Yan Leng, Qiaosheng Guo, Jiayi Sun, and Jiankun Peng, “Low complexity speech enhancement network based on frame-level Swin transformer,”Electronics, vol. 12, no. 6, pp. 1330, 2023

work page 2023

[9] [9]

Ultra low complexity deep learning based noise suppression,

Shrishti Saha Shetu, Soumitro Chakrabarty, Oliver Thiergart, and Edwin Mabande, “Ultra low complexity deep learning based noise suppression,” inICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 466– 470

work page 2024

[10] [10]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Kyunghyun Cho, Bart Van Merri ¨enboer, Dzmitry Bah- danau, and Yoshua Bengio, “On the properties of neu- ral machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[11] [11]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated re- current neural networks on sequence modeling,”arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[12] [12]

FastGRNN: A fast, accurate, stable and tiny kilobyte sized gated recur- rent neural network,

Aditya Kusupati, Manish Singh, Kush Bhatia, Ashish Kumar, Prateek Jain, and Manik Varma, “FastGRNN: A fast, accurate, stable and tiny kilobyte sized gated recur- rent neural network,”Advances in neural information processing systems, vol. 31, 2018

work page 2018

[13] [13]

Complex ratio masking for monaural speech separation,

Donald S. Williamson, Yuxuan Wang, and DeLiang Wang, “Complex ratio masking for monaural speech separation,”IEEE/ACM transactions on audio, speech, and language processing, vol. 24, no. 3, pp. 483–492, 2015

work page 2015

[14] [14]

A consolidated view of loss functions for supervised deep learning-based speech enhancement,

Sebastian Braun and Ivan Tashev, “A consolidated view of loss functions for supervised deep learning-based speech enhancement,” in2021 44th International Con- ference on Telecommunications and Signal Processing (TSP). IEEE, 2021, pp. 72–76

work page 2021

[15] [15]

The in- terspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results.arXiv preprint arXiv:2005.13981,

Chandan K.A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, et al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective test- ing framework, and challenge results,”arXiv preprint arXiv:2005.13981, 2020

work page arXiv 2020

[16] [16]

DNSMOS P. 835: A non-intrusive perceptual objec- tive speech quality metric to evaluate noise suppres- sors,

Chandan K.A. Reddy, Vishak Gopal, and Ross Cutler, “DNSMOS P. 835: A non-intrusive perceptual objec- tive speech quality metric to evaluate noise suppres- sors,” inICASSP 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 886–890

work page 2022

[17] [17]

ITU-T, “Recommendation P.862: Perceptual Evalua- tion of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow- band Telephone Networks and Speech Codecs,” Stan- dard P.862, International Telecommunication Union, 2001

work page 2001

[18] [18]

SDR–half-baked or well done?,

Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey, “SDR–half-baked or well done?,” in ICASSP 2019 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630

work page 2019