pith. sign in

arxiv: 2604.12480 · v1 · submitted 2026-04-14 · 💻 cs.SD · cs.AI

Audio Source Separation in Reverberant Environments using β-divergence based Nonnegative Factorization

Pith reviewed 2026-05-10 14:36 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords audio source separationnonnegative tensor factorizationβ-divergencereverberant environmentsspectral basis matricesmultichannel Wiener filteringsparsity controlGaussian model
0
0 comments X

The pith

Nonnegative tensor factorization with β-divergence and pre-trained spectral bases estimates source variances more effectively for multichannel separation in reverberant rooms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in Gaussian model-based multichannel audio source separation, parameters such as source spectral variances and spatial covariance matrices can be estimated more accurately by applying nonnegative tensor factorization to power spectra, guided by spectral basis matrices drawn from a redundant pre-trained library. The factorization minimizes the β-divergence using multiplicative updates, with the value of β controlling the sparsity of the solution. Experiments across multiple mixing conditions demonstrate that this yields higher separation quality via subsequent multichannel Wiener filtering than comparable methods, and that the resulting sparsity matters more for performance than the particular β chosen during training.

Core claim

In Gaussian model-based multichannel audio source separation, the likelihood of observed mixtures is parametrized by source spectral variances and spatial covariance matrices that are estimated by maximizing the likelihood through an Expectation-Maximization algorithm; the paper shows these parameters can instead be obtained by nonnegative tensor factorization that minimizes the β-divergence between observed power spectra and reconstructions formed from spectral basis matrices extracted or detected from a redundant pre-trained library, after which the signals are separated by multichannel Wiener filtering.

What carries the argument

β-divergence minimization in nonnegative tensor factorization that extracts or detects spectral basis matrices from a pre-trained redundant library to represent source power spectra.

If this is right

  • The same library-based factorization step can be used either to extract new bases or to detect the best-matching ones from the library for any given mixture.
  • Sparsity induced by the choice of β improves separation performance regardless of the β value used when the library itself was trained.
  • The resulting parameter estimates feed directly into the multichannel Wiener filter to produce the separated signals.
  • The approach applies across a range of reverberant mixing conditions and outperforms other comparable algorithms on separation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The library of spectral bases could be updated incrementally from new recordings without retraining the entire factorization pipeline from scratch.
  • Because sparsity dominates over the exact β value, similar performance gains may appear in other audio tasks that rely on nonnegative factorization with adjustable divergence parameters.
  • The separation quality improvement suggests the method could reduce the amount of training data needed for the library when only a few source classes are expected in the mixtures.

Load-bearing premise

That spectral basis matrices extracted or detected from a redundant pre-trained library accurately represent the power spectra of the actual source signals present in the observed reverberant mixtures, allowing the β-divergence minimization to yield superior parameter estimates.

What would settle it

Running the identical separation pipeline on the same reverberant mixtures both with and without access to the pre-trained library and measuring whether separation metrics such as signal-to-distortion ratio remain unchanged.

Figures

Figures reproduced from arXiv: 2604.12480 by Mahmoud Fakhry, Maurizio Omologo, Piergiorgio Svaizer.

Figure 1
Figure 1. Figure 1: Distance measured by different definitions of divergence, with [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of controlling the sparsity of Wn of the corrupted power spectrum by selecting the value of β, from the left to the right, respectively, original Wn (training) with β = 0.9, estimated with β = 0.1, estimated with β = 0.3, estimated with β = 0.6, and estimated with β = 0.9 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Normalized power spectra of true, corrupted, and reconstructed signals, from the left to the right respectively. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Flowchart of the proposed method θn = {wn(l), Rn(ω)} [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Normalized Likelihoods (d) of each spectral basis vector in the library Ulib as a function of the separation iterations. First column on the left for the observed mixtures, then the five columns for iterations (1, 10, 25, 40 and 65 respectively) and rows for 3 speech source signals [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Normalized Likelihoods of each spectral basis matrix [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Iterative improvement of SDR for four different detection [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The average separation performance of the informed case [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

In Gaussian model-based multichannel audio source separation, the likelihood of observed mixtures of source signals is parametrized by source spectral variances and by associated spatial covariance matrices. These parameters are estimated by maximizing the likelihood through an Expectation-Maximization algorithm and used to separate the signals by means of multichannel Wiener filtering. We propose to estimate these parameters by applying nonnegative factorization based on prior information on source variances. In the nonnegative factorization, spectral basis matrices can be defined as the prior information. The matrices can be either extracted or indirectly made available through a redundant library that is trained in advance. In a separate step, applying nonnegative tensor factorization, two algorithms are proposed in order to either extract or detect the basis matrices that best represent the power spectra of the source signals in the observed mixtures. The factorization is achieved by minimizing the $\beta$-divergence through multiplicative update rules. The sparsity of factorization can be controlled by tuning the value of $\beta$. Experiments show that sparsity, rather than the value assigned to $\beta$ in the training, is crucial in order to increase the separation performance. The proposed method was evaluated in several mixing conditions. It provides better separation quality with respect to other comparable algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes estimating source spectral variances in a Gaussian model-based multichannel audio source separation framework for reverberant environments by applying β-divergence nonnegative matrix factorization (NMF) and nonnegative tensor factorization. Spectral basis matrices serving as priors are either extracted or detected from a redundant pre-trained library via multiplicative update rules that minimize the β-divergence; sparsity is controlled by the choice of β. The separated signals are obtained via multichannel Wiener filtering. Experiments across mixing conditions are claimed to show that sparsity (rather than the specific β value) is the key factor improving separation quality relative to comparable algorithms.

Significance. If the empirical superiority holds under detailed scrutiny, the work would offer a practical route to injecting spectral priors into reverberant source separation via β-NMF, with the sparsity insight providing a useful tuning guideline. The integration of pre-trained libraries with tensor factorization steps could extend existing NMF-based separation pipelines.

major comments (2)
  1. Abstract: the central claim that the method 'provides better separation quality with respect to other comparable algorithms' is load-bearing for the paper's contribution, yet the abstract supplies no quantitative metrics (e.g., SDR, SIR, SAR values), no statistical tests, no description of baselines, data sets, or error analysis. This leaves the performance advantage only weakly supported.
  2. Method description (nonnegative factorization and basis extraction steps): the approach presupposes that spectral basis matrices extracted or detected from the pre-trained library faithfully represent the power spectra of the actual sources inside the observed reverberant mixtures. Reverberation convolves sources with room impulse responses and thereby alters observed spectra; the manuscript does not indicate whether the library was trained on anechoic or reverberant data, nor whether the subsequent β-divergence minimization compensates for any mismatch. If the bases are mismatched, the reported gains may be artifacts of the tested conditions rather than a general advantage of the β-NMF formulation.
minor comments (2)
  1. Abstract: the phrasing 'two algorithms are proposed in order to either extract or detect the basis matrices' is slightly ambiguous; a clearer distinction between the extraction and detection procedures would aid readability.
  2. Notation: ensure that the distinction between the β-divergence used in training versus in the factorization step is consistently denoted throughout the text and equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: Abstract: the central claim that the method 'provides better separation quality with respect to other comparable algorithms' is load-bearing for the paper's contribution, yet the abstract supplies no quantitative metrics (e.g., SDR, SIR, SAR values), no statistical tests, no description of baselines, data sets, or error analysis. This leaves the performance advantage only weakly supported.

    Authors: We agree that the abstract would benefit from more concrete support for the performance claims. In the revised manuscript we will expand the abstract to include representative quantitative results (e.g., average SDR gains relative to the compared algorithms), a brief statement of the evaluation metrics, and a high-level description of the datasets and mixing conditions used. This will strengthen the central claim while remaining within abstract length limits. revision: yes

  2. Referee: Method description (nonnegative factorization and basis extraction steps): the approach presupposes that spectral basis matrices extracted or detected from the pre-trained library faithfully represent the power spectra of the actual sources inside the observed reverberant mixtures. Reverberation convolves sources with room impulse responses and thereby alters observed spectra; the manuscript does not indicate whether the library was trained on anechoic or reverberant data, nor whether the subsequent β-divergence minimization compensates for any mismatch. If the bases are mismatched, the reported gains may be artifacts of the tested conditions rather than a general advantage of the β-NMF formulation.

    Authors: The library supplies general spectral templates for the source classes; the β-NMF step is performed on power spectra estimated from the reverberant mixtures themselves, so that the factorization selects and weights the library bases to best fit the observed (reverberant) data. This adaptation step is intended to mitigate spectral mismatch caused by room impulse responses. We will revise the method section to state explicitly that the library is trained on anechoic recordings and to clarify how the β-divergence minimization on the mixture spectra provides the necessary adaptation. The consistent improvements observed across multiple reverberant conditions support the practical utility of this procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation applies standard β-divergence nonnegative matrix and tensor factorization using externally pre-trained spectral basis libraries to estimate source variances and spatial covariances for multichannel Wiener filtering. Parameter estimation proceeds via multiplicative updates, with sparsity controlled by β, and performance is assessed empirically on held-out reverberant mixtures across multiple conditions. No load-bearing step reduces by construction to a fitted input, self-citation chain, or renamed ansatz; the separation quality gains are grounded in independent experimental comparisons rather than tautological re-expression of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the availability of a pre-trained redundant library of spectral bases and on the assumption that β-divergence minimization with tunable sparsity produces parameter estimates superior to EM; β itself functions as a tunable hyperparameter rather than a fitted constant.

free parameters (1)
  • β
    Controls the sparsity of the factorization; its value is tuned rather than derived, and experiments indicate sparsity level matters more than the specific training value of β.
axioms (2)
  • domain assumption Source power spectra in reverberant mixtures can be adequately represented by nonnegative linear combinations of basis matrices drawn from a pre-trained redundant library.
    Invoked when defining prior information for the nonnegative factorization step.
  • domain assumption Minimizing β-divergence via multiplicative updates yields parameter estimates that improve downstream multichannel Wiener filtering.
    Core modeling choice linking factorization to separation quality.

pith-pipeline@v0.9.0 · 5523 in / 1403 out tokens · 30518 ms · 2026-05-10T14:36:40.811715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Makino, Te-Won Lee, and H

    S. Makino, Te-Won Lee, and H. Sawada, Blind Speech Separa- tion. 3rd ed. Berlin, Germany: Springer, 2007

  2. [2]

    Hyvarinen, J

    A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis. New York, NY, USA:Wiley, 2001

  3. [3]

    Cichocki and S

    A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing. New York, NY, USA:Wiley, 2003

  4. [4]

    Universal speech models for speaker independent single channel source separation,

    D. Sun and G. Mysore, “Universal speech models for speaker independent single channel source separation,” In Proceedings of ICASSP, pp. 141-145, 2013

  5. [5]

    A General Flexible Framework for the Handling of Prior Information in Audio Source Separation,

    A. Ozerov, E. Vincent, and F. Bimbot, “A General Flexible Framework for the Handling of Prior Information in Audio Source Separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 4, pp. 1118-1133, 2012

  6. [6]

    Audio Source Sepa- ration using a Redundant Library of Source Spectral Bases for Non-negative Tensor Factorization,

    M. Fakhry, P. Svaizer, and M. Omologo, “Audio Source Sepa- ration using a Redundant Library of Source Spectral Bases for Non-negative Tensor Factorization,” In Proceedings of ICASSP, pp. 251-255, 2015

  7. [7]

    Estimation of the Spatial Information in Gaussian Model based Audio Source Separation using Weighted Spectral Bases,

    M. Fakhry, P. Svaizer, and M. Omologo, “Estimation of the Spatial Information in Gaussian Model based Audio Source Separation using Weighted Spectral Bases,” In Proceedings of EUSIPCO, 2016

  8. [8]

    Underdetermined Source Detection and Separation Using a Normalized Multichannel Spatial Dictio- nary

    M. Fakhry and F. Nesta, “Underdetermined Source Detection and Separation Using a Normalized Multichannel Spatial Dictio- nary”, In Proceedings of IW AENC, 2012

  9. [9]

    Unsupervised Spatial Dictionary Learning for Sparse Underdetermined Multichannel Source Sep- aration

    F. Nesta and M. Fakhry, “Unsupervised Spatial Dictionary Learning for Sparse Underdetermined Multichannel Source Sep- aration”, In Proceedings of ICASSP, pp. 86-90, 2013

  10. [10]

    Under-Determined Reverberant Audio Source Separation using a Full-Rank Spatial Covariance Model,

    N. Duong, E. Vincent, and R. Gribonval, “Under-Determined Reverberant Audio Source Separation using a Full-Rank Spatial Covariance Model,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 7, pp. 1830-1840, 2010

  11. [11]

    Duong, E

    N. Duong, E. Vincent, and R. Gribonval, “Spatial Location Priors for Gaussian Model based Reverberant Audio Source SeparationEURASIP Journal of Advanced Signal Processing, pp. 149, 2013

  12. [12]

    Underdetermined Convo- lutive Blind Source Separation via Frequency Bin-Wise Cluster- ing and Permutation Alignment,

    H. Sawada, S. Araki, and S. Makino, “Underdetermined Convo- lutive Blind Source Separation via Frequency Bin-Wise Cluster- ing and Permutation Alignment,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no 3, pp 516-527, 2011

  13. [13]

    Blind Separation of Speech Mixtures via Time-frequency Masking,

    O. Yilmaz and S. Rickard, “Blind Separation of Speech Mixtures via Time-frequency Masking,” IEEE Transactions on Signal Processing, vol. 52, no.7, pp 1830-1847, 2004

  14. [14]

    Underdetermined Blind Source Separation using Sparse Representation

    P. Bofill and M. Zibulevsky, “Underdetermined Blind Source Separation using Sparse Representation”, IEEE Transactions on Signal Processing, vol. 81, no. 11, pp. 2353-2362, 2001

  15. [15]

    Underdeter- mined Blind Source Separation based on Sparse Representation

    Y. Q. Li, S. Amari, A. Cichocki, and D. W. C. Ho, “Underdeter- mined Blind Source Separation based on Sparse Representation”, IEEE Transactions on signal processing, vol. 54, no. 2, pp. 423- 437, 2006

  16. [16]

    Complex Nonconvex lp Norm Minimization for Underdetermined Source Separation,

    E. Vincent, “Complex Nonconvex lp Norm Minimization for Underdetermined Source Separation,” In Proceedings of ICA, pp. 430-437, 2007

  17. [17]

    Maximum likelihood Approach for Blind Audio Source Separation using Time-frequency Gaus- sian Models,

    C. Fevotte and J. F. Cardoso, “Maximum likelihood Approach for Blind Audio Source Separation using Time-frequency Gaus- sian Models,” In Proceedings of W ASPAA, 2005

  18. [18]

    Learning The Parts of Objects by Non-negative Matrix Factorization,

    D. D. Lee and H. S. Seung, “Learning The Parts of Objects by Non-negative Matrix Factorization,” Nature, vol. 401, no. 6755, pp. 788-791, Oct., 1999

  19. [19]

    Cichocki, R

    A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari, Nonnegative Matrix and Tensor factorizations: Applications to Exploratory Multi-way Data Analysis, John Wiley, 2009

  20. [20]

    Non-negative Tensor Factorisation of Modulation Spectrograms for Monaural Sound Source Sepa- ration,

    T. Barker and T. Virtanen, “Non-negative Tensor Factorisation of Modulation Spectrograms for Monaural Sound Source Sepa- ration,” In Proceedings of INTERSPEECH, 2013. IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL., NO. 15

  21. [21]

    Sound Source Separation based on Non-negative Tensor Factorization Incorporating Spatial Cue as Prior Knowledge,

    Y. Mitsufuji and A. Roebel, “Sound Source Separation based on Non-negative Tensor Factorization Incorporating Spatial Cue as Prior Knowledge,” In Proceedings of ICASSP, 2013

  22. [22]

    Nonnegative Matrix Factorization with the Itakura-Saito Divergence With Applica- tions to Music Analysis,

    C. Fevotte, N. Bertin, and J.-L. Durrieu, “Nonnegative Matrix Factorization with the Itakura-Saito Divergence With Applica- tions to Music Analysis,” Neural Computation, vol. 21, no.3, pp. 793-830, Mar., 2009

  23. [23]

    Multichannel Extensions of Non-Negative Matrix Factorization with Complex- Valued Data,

    H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannel Extensions of Non-Negative Matrix Factorization with Complex- Valued Data,” IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 21, no. 5, May 2013

  24. [24]

    Direction of Arrival based Spatial Covariance Model for Blind Sound Source Separation,

    J. Nikunen and T. Virtanen, “Direction of Arrival based Spatial Covariance Model for Blind Sound Source Separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 22, no. 3, 2014

  25. [25]

    Robust and Efficient Estimation by minimizing a density power divergence

    A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones, “Robust and Efficient Estimation by minimizing a density power divergence”, Biometrika, 85(3):549-559, Sep. 1998

  26. [26]

    A Generalized Divergence Measure for Nonneg- ative Matrix Factorization,

    R. Kompass, “A Generalized Divergence Measure for Nonneg- ative Matrix Factorization,” Neural Computation, vol.19, no. 3, pp. 780-791, Mar., 2007

  27. [27]

    Fevotte and J

    C. Fevotte and J. Idier. Algorithms for Nonnegative Matrix Factorization with the Feta-divergence. Neural Computation. 2011

  28. [28]

    On the use of the Beta Divergence for Musical Source Separation

    D. FitzGerald, M. Cranitch, and E. Coyle, “On the use of the Beta Divergence for Musical Source Separation”, In Proceedings of the Irish Signals and Systems Conference (ISSC), 2008

  29. [29]

    Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Sep- aration,

    A. Ozerov and C. Fevotte, “Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Sep- aration,” IEEE Transactions on Audio, Speech and Language Processing, vol.18, no. 3, pp. 550-563, 2010

  30. [30]

    Nonnegative Matrix Factorization and Spatial Covariance Model for Under-determined Reverberant Audio Source Separation,

    S. Arberet, A. Ozerov, N. Duong, E. Vincent, R. Gribonval, F. Bimbot, and Vandergheynst, “Nonnegative Matrix Factorization and Spatial Covariance Model for Under-determined Reverberant Audio Source Separation,” In Proceedings ISSPA 10, 2010, 1-4

  31. [31]

    Maximum Likelihood from Incomplete Data via the EM Algorithm,

    A. P. Dempster, N. M. Laird, and D. B. Rubin , “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Jour- nal of the Royal Statistical Society. Series B (Methodological), vol. 39, 1-38, 1977

  32. [32]

    On the output SNR of the Speech- Distortion Weighted Multichannel Wiener Filter,

    S. Doclo and M. Moonen, “On the output SNR of the Speech- Distortion Weighted Multichannel Wiener Filter,” IEEE SPL, 2005

  33. [33]

    Use Of The Cross-Power-Spectrum Phase in Acoustic Event Location,

    M. Omologo and P. Svaizer, “Use Of The Cross-Power-Spectrum Phase in Acoustic Event Location,” IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 288–292, 1997

  34. [34]

    First Stereo Audio Source Separation Evaluation Campaign: data, algorithms and results,

    E. Vincent, H. Sawada, P. Bofill, S. Makino, and J. Rosca, “First Stereo Audio Source Separation Evaluation Campaign: data, algorithms and results,” In Proceedings of ICA, pp. 552- 559, 2007

  35. [35]

    Image Method for Efficiently Simulating Small-Room Acoustics,

    J. Allen and D. Berkeley, “Image Method for Efficiently Simulating Small-Room Acoustics,” American Acoustic Society, vol.65, no. 4, pp. 943-950, April, 1979

  36. [36]

    Campbell, Roomsim Toolbox [Online]

    D. Campbell, Roomsim Toolbox [Online]. A vailable: http://www.mathworks.com/matlabcentral/fileexchange/5184

  37. [37]

    Convolutive Underdetermined Source Separation through Weighted Interleaved ICA and Spatio- temporal Source Correlation

    F. Nesta and M. Omologo, “Convolutive Underdetermined Source Separation through Weighted Interleaved ICA and Spatio- temporal Source Correlation”, L V A/ICA, volume 7191 of Lecture Notes in Computer Science, page 222-230. Springer, 2012

  38. [38]

    Underdetermined Convolutive BSS: Bayes Risk Minimization based on a Mixture of Super-Gaussian Posterior Approximation,

    J. Cho and C. D. Yoo, “Underdetermined Convolutive BSS: Bayes Risk Minimization based on a Mixture of Super-Gaussian Posterior Approximation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 5. pp. 828-839, 2015. PLACE PHOTO HERE Mahmoud Fakhry received the B.Sc. and M.Sc. degrees in Electrical Engineering from the Department of ...

  39. [39]

    He is a senior researcher at the Center for Information and Communication Technology, Fondazione Bruno Kessler, Trento, Italy. His main research interests include speech analysis, enhance- ment and separation, front-end processing and noise reduction algo- rithms for speech recognition in adverse conditions, and microphone arrays for acoustic scene analys...