arxiv: 2605.08189 · v1 · submitted 2026-05-05 · 📡 eess.AS

Recognition: 2 theorem links

· Lean Theorem

DiffVQE: Hybrid Diffusion Voice Quality Enhancement Under Acoustic Echo and Noise

Haljan Lugo Girao , Ernst Seidel , Pejman Mowlaee , Ziyue Zhao , Tim Fingscheidt

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:24 UTC · model grok-4.3

classification 📡 eess.AS

keywords acoustic echo cancellationspeech enhancementdiffusion modelsdenoisingvoice quality enhancementgenerative models

0 comments

The pith

DiffVQE is the first reproducible diffusion-based model for joint acoustic echo control and speech denoising that beats the leading discriminative method in quality and efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DiffVQE as a diffusion-based system for enhancing voice quality by suppressing acoustic echo and background noise in hands-free setups. It claims this is the first such non-causal diffusion approach that is fully specified in topology, data, and training. Trained on the diverse Interspeech 2025 URGENT Challenge dataset, DiffVQE is reported to surpass Microsoft's DeepVQE in echo and noise removal while using less computation and a smaller model. A sympathetic reader would care because generative diffusion techniques have already lifted other speech tasks, so applying them here could shift how real-world audio systems handle common degradations.

Core claim

The central claim is that a hybrid diffusion model trained on the URGENT Challenge dataset delivers better joint acoustic echo control and denoising than the prior leading discriminative model DeepVQE, while also reducing computational complexity and model size.

What carries the argument

The hybrid diffusion process that learns to generate clean speech from inputs degraded by echo and noise.

If this is right

DiffVQE achieves stronger echo cancellation and denoising than DeepVQE on the chosen dataset.
The diffusion model requires lower computational complexity than the baseline.
Model size is reduced while maintaining or improving performance.
A reproducible diffusion baseline is established for acoustic echo cancellation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Making the model causal could open the door to real-time deployment in live communication devices.
The same diffusion framework might extend to other combined audio degradations such as reverberation or packet loss.
Success here suggests diffusion models could become competitive defaults for joint enhancement problems rather than separate modules.

Load-bearing premise

Training a diffusion model on the URGENT Challenge dataset will reliably deliver superior joint echo and noise performance compared to strong discriminative baselines like DeepVQE.

What would settle it

Evaluating both DiffVQE and DeepVQE on the same URGENT Challenge test set and finding no gain in objective metrics such as echo return loss enhancement or perceptual speech quality scores.

Figures

Figures reproduced from arXiv: 2605.08189 by Ernst Seidel, Haljan Lugo Girao, Pejman Mowlaee, Tim Fingscheidt, Ziyue Zhao.

**Figure 1.** Figure 1: Overview of our end-to-end hands-free system using a hybrid diffusion approach. Cond and Score networks are the discriminative and generative networks as utilized in [5]. n(n) are also picked up by the microphone. Thus, the microphone signal is given as y(n) = s ′ (n) + d(n) + n(n). As both microphone and far-end speech are used as inputs for the hybrid diffusion model, both are transformed using a K-poin… view at source ↗

**Figure 2.** Figure 2: Cond and Score DNN topology, details see [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Dval performance dependency on SER in DT. expressive intrusive metrics for both quality (PESQ) as well as intelligibility (LPS, ESTOI) to assess near-end speech degradation in a controlled manner. Moreover, we report the number of parameters, FLOPS, and RTF (measured on a single thread of an AMD EPYC 9575F CPU @ 3.3 GHz). We also report the average rank among the three compared methods over all DT, STFE, … view at source ↗

read the original abstract

Acoustic echo and background noise pose challenges on speech enhancement in hands-free systems and speakerphones. Discriminatively trained end-to-end methods represent a powerful solution for joint acoustic echo control (AEC) and denoising. However, with the advent of generative methods, diffusion-based approaches have seen remarkable performance in speech enhancement tasks. In this work, to the best of our knowledge, we provide the first (still non-causal) diffusion-based AEC model (DiffVQE) that is reproducible in terms of topology, training data, and training framework. So far, without employing diffusion, Microsoft's discriminative DeepVQE model has been shown to excel any of the ICASSP 2023 AEC Challenge entries achieving remarkable performance. Using data from the Interspeech 2025 URGENT Challenge for a diverse, high-quality training dataset, our DiffVQE excels DeepVQE both in echo and noise control performance, as well as in computational complexity and model size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces DiffVQE as the first reproducible (non-causal) diffusion-based model for joint acoustic echo control (AEC) and denoising. It claims that, when trained on the Interspeech 2025 URGENT Challenge dataset, DiffVQE outperforms Microsoft's earlier discriminative DeepVQE model in echo/noise control performance while also improving computational complexity and model size.

Significance. If the performance claims are substantiated with matched-data controls and quantitative results, the work would be significant for demonstrating that hybrid diffusion models can be applied effectively to AEC tasks, potentially yielding smaller and more efficient solutions than purely discriminative approaches. The explicit emphasis on reproducibility of topology, data, and framework is a clear strength.

major comments (1)

[Abstract] Abstract: The central claim that DiffVQE 'excels DeepVQE both in echo and noise control performance, as well as in computational complexity and model size' is load-bearing for the paper's contribution, yet the abstract provides no metrics, tables, ablation studies, or experimental details to support it. In addition, DeepVQE predates the URGENT Challenge; without a matched-data evaluation (e.g., retraining or re-evaluating DeepVQE on the identical URGENT dataset), gains cannot be attributed to the diffusion architecture rather than differences in training data quality and diversity.

minor comments (1)

[Abstract] Abstract: The qualifier '(still non-causal)' is mentioned but not elaborated; a brief discussion of latency implications for hands-free applications would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the concerns point by point below and will revise the manuscript to strengthen the presentation of results and clarify the experimental comparisons.

read point-by-point responses

Referee: The central claim that DiffVQE 'excels DeepVQE both in echo and noise control performance, as well as in computational complexity and model size' is load-bearing for the paper's contribution, yet the abstract provides no metrics, tables, ablation studies, or experimental details to support it.

Authors: We agree that the abstract should be more self-contained. In the revised version we will insert the key quantitative results (ERLE, PESQ, STOI deltas, parameter count, and real-time factor) that are already reported in the experimental section, so that the central claim is immediately supported by numbers. revision: yes
Referee: In addition, DeepVQE predates the URGENT Challenge; without a matched-data evaluation (e.g., retraining or re-evaluating DeepVQE on the identical URGENT dataset), gains cannot be attributed to the diffusion architecture rather than differences in training data quality and diversity.

Authors: We acknowledge the limitation. Our current comparison evaluates the publicly released DeepVQE checkpoint on the URGENT test set while DiffVQE is trained on the URGENT training partition; this guarantees identical test conditions but does not control for training-data differences. We will add an explicit statement of this protocol in the revised manuscript, qualify the attribution of gains, and note that retraining DeepVQE on the URGENT data would be a valuable future experiment. The reproducibility of DiffVQE itself on the URGENT corpus remains a distinct contribution. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical performance claims

full rationale

The paper's central claims rest on training a diffusion model (DiffVQE) on the URGENT Challenge dataset and reporting empirical superiority over the prior DeepVQE model in echo/noise control, complexity, and size. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described claims. The reproducibility statement and 'first diffusion-based AEC' positioning are factual assertions about the work, not tautological reductions. Dataset differences in the DeepVQE comparison raise validity concerns but do not create circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an empirical neural model at a high level and introduces no explicit free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5482 in / 1090 out tokens · 54106 ms · 2026-05-12T01:24:07.041356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

score-based diffusion ... Itô SDE ... denoising score matching objective JSM ... single-step ... J = J_CC(Ŝcond,S) + J_CC(Ŝ,S) + α J_SM
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid diffusion approach ... DiffVQE ... outperforms DeepVQE in echo control ... model complexity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

Predominantly in noise reduction tasks, gen- erative approaches have gained significant traction

Introduction Speech enhancement has undergone a significant paradigm shift in recent years. Predominantly in noise reduction tasks, gen- erative approaches have gained significant traction. Previ- ously, many approaches utilized some form of mean squared error (MSE) loss either in time domain or in frequency do- main to train discriminative masked-based d...

work page 2025
[2]

DiffVQE: Hybrid Diffusion Voice Quality Enhancement Under Acoustic Echo and Noise

Methods 2.1. Data representation and framework overview An overview of our hands-free system is given in Fig. 1. The far-end signalx(n)with sample indexnis transmitted to the near-end and played back by a loudspeaker. Loudspeaker non- linearities are modeled byx ′(n) =f NL(x(n)). The micro- phone receivesx ′(n)as an echod(n) =h 1(n)∗x ′(n), with h1(n)bein...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Experimental setup 3.1. Datasets and framework To generate a diverse set of samples, our proposedDiffVQEis trainedon a dataset comprising speech and noise sources from the Interspeech 2025 URGENT Challenge [19]. As generative methods benefit highly from high quality ground truth targets in training, we exclude the CommonV oice 19.0 [28] dataset. We furthe...

work page 2025
[4]

Besides the AECMOS metrics, we include Table 1:Model performance onD val in all three conditions

Experimental evaluation and discussion In Table 1, we show results of our proposedDiffVQEvari- ants as well as from the retrainedDeepVQEbaseline onD val for all conditions. Besides the AECMOS metrics, we include Table 1:Model performance onD val in all three conditions. Best performance is indicated in bold, second best is underlined. DT STFE STNE Avg. Me...

work page 2023
[5]

It is one of the first diffusion-based acoustic echo con- trol (AEC) methods (still non-causal), being smaller, less com- plex and faster than the so-far SOTADeepVQE

Conclusions In this work, we proposed a novel hybrid score-based diffusion approach to voice quality enhancement under acoustic echo and noise. It is one of the first diffusion-based acoustic echo con- trol (AEC) methods (still non-causal), being smaller, less com- plex and faster than the so-far SOTADeepVQE. Our proposed DiffVQEapproaches excelDeepVQEin ...

work page
[6]

Speech Enhancement with Score-Based Generative Models in the Complex STFT Do- main,

S. Welker, J. Richter, and T. Gerkmann, “Speech Enhancement with Score-Based Generative Models in the Complex STFT Do- main,” inProc. of Interspeech, Incheon, Korea, Sep. 2022, pp. 2928–2932

work page 2022
[7]

StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, Jul. 2022

work page 2022
[8]

Speech Enhancement and Dereverberation With Diffusion- Based Generative Models,

J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech Enhancement and Dereverberation With Diffusion- Based Generative Models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023

work page 2023
[9]

Universal Score-based Speech Enhancement with High Content Preserva- tion,

R. Scheibler, Y . Fujita, Y . Shirahata, and T. Komatsu, “Universal Score-based Speech Enhancement with High Content Preserva- tion,” inProc. of Interspeech, Kos, Greece, Sep. 2024, pp. 1165– 1169

work page 2024
[10]

EffDiffSE: Efficient Diffusion-Based Frequency-Domain Speech Enhance- ment with Hybrid Discriminative and Generative DNNs,

Y . Fu, R. Shi, M. Sach, W. Tirry, and T. Fingscheidt, “EffDiffSE: Efficient Diffusion-Based Frequency-Domain Speech Enhance- ment with Hybrid Discriminative and Generative DNNs,” inProc. of WASPAA, Tahoe City, CA, USA, Oct. 2025, pp. 1–5

work page 2025
[11]

DeepVQE: Real Time Deep V oice Quality En- hancement for Joint Acoustic Echo Cancellation, Noise Suppres- sion and Dereverberation,

E. Indenbom, N.-C. Ristea, A. Saabas, T. Parnamaa, J. Guzvin, and R. Cutler, “DeepVQE: Real Time Deep V oice Quality En- hancement for Joint Acoustic Echo Cancellation, Noise Suppres- sion and Dereverberation,” inProc. of Interspeech, Dublin, Ire- land, Aug. 2023, pp. 3819–3823

work page 2023
[12]

H ¨ansler and G

E. H ¨ansler and G. Schmidt,Acoustic Echo and Noise Control: A Practical Approach. Wiley, 2004

work page 2004
[13]

Frequency-Domain Adaptive Kalman Fil- ter for Acoustic Echo Control in Hands-Free Telephones,

G. Enzner and P. Vary, “Frequency-Domain Adaptive Kalman Fil- ter for Acoustic Echo Control in Hands-Free Telephones,”Signal Processing, vol. 86, no. 6, pp. 1140–1156, Jun. 2006

work page 2006
[14]

Neural Kalman Filters for Acoustic Echo Cancellation: Comparison of Deep Neural Network-Based Extensions,

E. Seidel, G. Enzner, P. Mowlaee, and T. Fingscheidt, “Neural Kalman Filters for Acoustic Echo Cancellation: Comparison of Deep Neural Network-Based Extensions,”IEEE Signal Process- ing Magazine, vol. 41, no. 4, pp. 24–38, Jan. 2024

work page 2024
[15]

End-to-End Deep Learning-Based Adaptation Control for Linear Acoustic Echo Cancellation,

T. Haubner, A. Brendel, and W. Kellermann, “End-to-End Deep Learning-Based Adaptation Control for Linear Acoustic Echo Cancellation,”IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 32, pp. 227–238, Oct. 2023

work page 2023
[16]

Low- Complexity Acoustic Echo Cancellation with Neural Kalman Fil- tering,

D. Yang, F. Jiang, W. Wu, X. Fang, and M. Cao, “Low- Complexity Acoustic Echo Cancellation with Neural Kalman Fil- tering,” inProc. of ICASSP, Rhodes Island, Greece, Jun. 2023, pp. 7846–7850

work page 2023
[17]

A Progressive Neural Network for Acoustic Echo Cancellation,

Z. Chen, X. Xia, S. Sun, Z. Wang, C. Chen, and G. Xie, “A Progressive Neural Network for Acoustic Echo Cancellation,” in Proc. of ICASSP, Rhodes Island, Greece, Mar. 2023, pp. 12 579– 12 580

work page 2023
[18]

Efficient High- Performance Bark-Scale Neural Network for Residual Echo and Noise Suppression,

E. Seidel, P. Mowlaee, and T. Fingscheidt, “Efficient High- Performance Bark-Scale Neural Network for Residual Echo and Noise Suppression,” inProc. of ICASSP, Seoul, Korea, Apr. 2024, pp. 1386–1390

work page 2024
[19]

A Hybrid Approach for Low- Complexity Joint Acoustic Echo and Noise Reduction,

S. S. Shetu, N. Kumar Desiraju, J. M. Martinez Aponte, E. A. P. Habets, and E. Mabande, “A Hybrid Approach for Low- Complexity Joint Acoustic Echo and Noise Reduction,” inProc. of IWAENC, Aalborg, Denmark, Sep. 2024, pp. 349–353

work page 2024
[20]

EchoFree: Towards Ultra Lightweight and Efficient Neural Acoustic Echo Cancellation,

X. Li, B. Kang, Z. Wang, Z. Zhang, M. Liu, Z. Fu, and L. Xie, “EchoFree: Towards Ultra Lightweight and Efficient Neural Acoustic Echo Cancellation,”arXiv, no. 2508.06271, Aug. 2025

work page arXiv 2025
[21]

Convergence and Per- formance Analysis of Classical, Hybrid, and Deep Acoustic Echo Control,

E. Seidel, P. Mowlaee, and T. Fingscheidt, “Convergence and Per- formance Analysis of Classical, Hybrid, and Deep Acoustic Echo Control,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2857–2870, May 2024

work page 2024
[22]

FSD: Acoustic Echo Cancellation with Fewer Step Diffusion,

Y . Liu, L. Wan, Y . Huang, M. Sun, C. Zhao, Z. Ni, X. Mei, Y . Shi, and F. Metze, “FSD: Acoustic Echo Cancellation with Fewer Step Diffusion,” inProc. of NeurIPS – Workshops, Vancouver, BC, Canada, Dec. 2024, pp. 1–6

work page 2024
[23]

UR- GENT Challenge: Universality, Robustness, and Generalizability for Speech Enhancement,

W. Zhang, R. Scheibler, K. Saijo, S. Cornell, C. Li, Z. Ni, J. Pirkl- bauer, M. Sach, S. Watanabe, T. Fingscheidt, and Y . Qian, “UR- GENT Challenge: Universality, Robustness, and Generalizability for Speech Enhancement,” inProc. of Interspeech, Kos, Greece, Sep. 2024, pp. 4868–4872

work page 2024
[24]

Interspeech 2025 URGENT Speech Enhancement Challenge,

K. Saijo, W. Zhang, S. Cornell, R. Scheibler, C. Li, Z. Ni, A. Ku- mar, M. Sach, Y . Fu, W. Wang, T. Fingscheidt, and S. Watanabe, “Interspeech 2025 URGENT Speech Enhancement Challenge,” inProc. of Interspeech, Rotterdam, Netherlands, Aug. 2025, pp. 858–862

work page 2025
[25]

DiffVQE Supplement,

H. Lugo Girao, E. Seidel, P. Mowlaee, Z. Zhao, and T. Fingscheidt, “DiffVQE Supplement,” https://ifnspaml.github. io/DiffVQE-Demo/, 2026

work page 2026
[26]

Score-Based Generative Modeling through Stochastic Differential Equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Er- mon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” inProc. of ICLR, Virtual Event, Austria, May 2021, pp. 1–36

work page 2021
[27]

Reverse-Time Diffusion Equation Models,

B. D. O. Anderson, “Reverse-Time Diffusion Equation Models,” Stochastic Processes and their Applications, vol. 12, no. 3, pp. 313–326, May 1982

work page 1982
[28]

A Connection Between Score Matching and Denois- ing Autoencoders,

P. Vincent, “A Connection Between Score Matching and Denois- ing Autoencoders,”Neural Computation, vol. 23, no. 7, pp. 1661– 1674, Jul. 2011

work page 2011
[29]

Adversarial Score Matching and Improved Sampling for Image Generation,

A. Jolicoeur-Martineau, R. Pich ´e-Taillefer, I. Mitliagkas, and R. T. des Combes, “Adversarial Score Matching and Improved Sampling for Image Generation,” inProc. of ICLR, May 2021, pp. 1–9

work page 2021
[30]

A Consolidated View of Loss Functions for Supervised Deep Learning-Based Speech Enhancement,

S. Braun and I. Tashev, “A Consolidated View of Loss Functions for Supervised Deep Learning-Based Speech Enhancement,” in Proc. of Conference on Telecommunications and Signal Process- ing (TSP), Brno, Czech Republic, Jul. 2021, pp. 72–76

work page 2021
[31]

Elucidating the De- sign Space of Diffusion-Based Generative Models,

T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the De- sign Space of Diffusion-Based Generative Models,” inProc. of NeurIPS, New Orleans, LA, USA, Dec. 2022, pp. 1–13

work page 2022
[32]

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network,

W. Shi, J. Caballero, F. Husz ´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network,” inProc. of CVPR, Las Vegas, NV , USA, Jun. 2016, pp. 1874–1883

work page 2016
[33]

Com- mon V oice: A Massively-Multilingual Speech Corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon V oice: A Massively-Multilingual Speech Corpus,” inProc. of LREC, Marseille, France, May 2020, pp. 4218–4222

work page 2020
[34]

Less is More: Data Curation Matters in Scaling Speech Enhancement,

C. Li, W. Zhang, W. Wang, R. Scheibler, K. Saijo, S. Cornell, Y . Fu, M. Sach, Z. Ni, A. Kumar, T. Fingscheidt, S. Watanabe, and Y . Qian, “Less is More: Data Curation Matters in Scaling Speech Enhancement,” inProc. of ASRU, Honululu, HI, USA, Dec. 2025, pp. 1–8

work page 2025
[35]

DNSMOS: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” inProc. of ICASSP, Toronto, ON, Canada, Jun. 2021, pp. 6493–6497

work page 2021
[36]

ICASSP 2024 Speech Signal Improvement Chal- lenge,

N. C. Ristea, A. Saabas, R. Cutler, B. Naderi, S. Braun, and S. Branets, “ICASSP 2024 Speech Signal Improvement Chal- lenge,”IEEE Open Journal of Signal Processing, vol. 6, pp. 238– 246, Jan. 2025

work page 2024
[37]

UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inProc. of Interspeech, Incheon, Korea, Sep. 2022, pp. 4521–4525

work page 2022
[38]

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inProc. of In- terspeech, Brno, Czech Republic, Aug. 2021, pp. 2127–2131

work page 2021
[39]

TorchAudio-Squim: Reference-less Speech Quality and Intelligibility Measures in TorchAudio,

A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson, and B. Xu, “TorchAudio-Squim: Reference-less Speech Quality and Intelligibility Measures in TorchAudio,” inProc. of ICASSP, Rhodes Island, Greece, May 2023, pp. 1–5

work page 2023
[40]

Pyroomacoustics: A Python Package for Audio Room Simulations and Array Process- ing Algorithms,

R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A Python Package for Audio Room Simulations and Array Process- ing Algorithms,” inProc. of ICASSP, Calgary, AB, Canada, Apr. 2018, pp. 1–5

work page 2018
[41]

ICASSP 2023 Acoustic Echo Cancellation Challenge,

R. Cutler, A. Saabas, T. Parnamaa, M. Purin, E. Indenbom, N.-C. Ristea, J. Guˇzvin, H. Gamper, S. Braun, and R. Aichner, “ICASSP 2023 Acoustic Echo Cancellation Challenge,”arXiv, Sep. 2023

work page 2023
[42]

TIMIT Acoustic-Phonetic Con- tinuous Speech Corpus,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pal- lett, N. L. Dahlgren, and V . Zue, “TIMIT Acoustic-Phonetic Con- tinuous Speech Corpus,” Linguistic Data Consortium, Philadel- phia, PA, USA, 1993

work page 1993
[43]

2008, Tech

ETSI,Speech Processing, Transmission and Quality Aspects (STQ); Speech Quality Performance in the Presence of Back- ground Noise; Part 1: Background Noise Simulation Technique and Background Noise Database, European Telecommunications Standards Institute, Sep. 2008, Tech. Rep. ETSI EG 202 396-1

work page 2008
[44]

A Binaural Room Impulse Response Database for the Evaluation of Dereverberation Al- gorithms,

M. Jeub, M. Sch ¨afer, and P. Vary, “A Binaural Room Impulse Response Database for the Evaluation of Dereverberation Al- gorithms,” inProc. of Int. Conf. on Digital Signal Processing, Santorini-Hellas, Greece, Jul. 2009, pp. 1–5

work page 2009
[45]

The Generalized Correlation Method for Estimation of Time Delay,

C. Knapp and G. Carter, “The Generalized Correlation Method for Estimation of Time Delay,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, Jan. 2003

work page 2003
[46]

AEC- MOS: A Speech Quality Assessment Metric for Echo Impair- ment,

M. Purin, S. Sootla, M. Sponza, A. Saabas, and R. Cutler, “AEC- MOS: A Speech Quality Assessment Metric for Echo Impair- ment,” inProc. of ICASSP, Singapore, Singapore, May 2022, pp. 901–905

work page 2022
[47]

P .862: Perceptual Evaluation of Speech Quality (PESQ), International Telecommunication Union, Telecommuni- cation Standardization Sector (ITU-T), Feb

ITU,Rec. P .862: Perceptual Evaluation of Speech Quality (PESQ), International Telecommunication Union, Telecommuni- cation Standardization Sector (ITU-T), Feb. 2001

work page 2001
[48]

Evaluation Metrics for Generative Speech Enhancement Methods: Issues and Perspectives,

J. Pirklbauer, M. Sach, K. Fluyt, W. Tirry, W. Wardah, S. M ¨oller, and T. Fingscheidt, “Evaluation Metrics for Generative Speech Enhancement Methods: Issues and Perspectives,” inProc. of 15th ITG Conference on Speech Communication, Aachen, Germany, Sep. 2023, pp. 265–269

work page 2023
[49]

An Algorithm for Predicting the In- telligibility of Speech Masked by Modulated Noise Maskers,

J. Jensen and C. H. Taal, “An Algorithm for Predicting the In- telligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 24, no. 11, pp. 2009–2022, 2016

work page 2009
[50]

P.808 Multilingual Speech Enhancement Testing: Ap- proach and Results of URGENT 2025 Challenge,

M. Sach, Y . Fu, K. Saijo, W. Zhang, S. Cornell, R. Scheibler, C. Li, A. Kumar, W. Wang, Y . Qian, S. Watanabe, and T. Fin- gscheidt, “P.808 Multilingual Speech Enhancement Testing: Ap- proach and Results of URGENT 2025 Challenge,”arXiv, no. 2507.11306, Jul. 2025

work page arXiv 2025