pith. sign in

arxiv: 1907.06859 · v1 · pith:DOUA4YHAnew · submitted 2019-07-16 · 📡 eess.AS · cs.SD

Towards Adapting NMF Dictionaries Using Total Variability Modeling for Noise-Robust Acoustic Features

Pith reviewed 2026-05-24 20:51 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords noise-robust acoustic featuresNMF dictionary adaptationtotal variability modelingspeech recognitionunseen noiseutterance-specific transformacoustic feature extraction
0
0 comments X

The pith

Total variability modeling adapts NMF dictionaries per utterance to produce noise-robust acoustic features without any parallel clean-noisy training pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method that learns a total variability subspace from NMF representations and uses it to create an utterance-specific transform for adapting dictionaries. This produces acoustic features that remain robust to the noise present in each individual utterance. The approach sidesteps the common requirement for paired clean and noisy speech data during training. On the Aurora 4 plus DEMAND noise corpus the resulting features match baseline performance overall and stay closest to clean-speech word error rates when the noise is unseen.

Core claim

A total variability subspace learned without parallel clean-noisy pairs can be combined with NMF to generate utterance-specific dictionary adaptations that yield acoustic features whose word error rates on noisy test data remain comparable to standard baselines and, on unseen noises, closest to the clean-speech baseline.

What carries the argument

Total variability subspace that produces an utterance-specific transform for adapting NMF dictionaries

If this is right

  • Noise-robust features become feasible without collecting or aligning clean-noisy parallel corpora.
  • Each utterance receives its own adaptation rather than a single global model.
  • Performance on unseen noise conditions improves relative to fixed dictionary approaches.
  • The same pipeline remains competitive with convolutive NMF features on seen noise conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could lower the data-collection burden for building robust speech recognizers in new acoustic environments.
  • Utterance-level adaptation might transfer to other signal-processing tasks where per-example dictionary or basis adjustment is useful.
  • Because the transform is computed from the test utterance itself, the approach may suit streaming or low-latency applications once the subspace is fixed.

Load-bearing premise

A subspace learned without paired clean-noisy examples can still generate transforms that meaningfully reduce the effect of noise on the extracted features.

What would settle it

If, on the Aurora 4 + DEMAND corpus with held-out noise types, the proposed features produce word error rates farther from the clean-speech rate than the CNMF or other baseline features.

Figures

Figures reproduced from arXiv: 1907.06859 by Colin Vaz, Kunal Dhawan, Ruchir Travadi, Shrikanth Narayanan.

Figure 1
Figure 1. Figure 1: Visualizing the dictionary W and activation matrix H after running NMF on a speech signal V. one can think of the dictionary as containing k components that are added together by the activation matrix to approximate the input matrix. In the case of speech, the input matrix is typically the magnitude spectrogram, and the dictionary contains spectral “building blocks” required to reconstruct the spectrogram.… view at source ↗
read the original abstract

We propose an algorithm to extract noise-robust acoustic features from noisy speech. We use Total Variability Modeling in combination with Non-negative Matrix Factorization (NMF) to learn a total variability subspace and adapt NMF dictionaries for each utterance. Unlike several other approaches for extracting noise-robust features, our algorithm does not require a training corpus of parallel clean and noisy speech. Furthermore, the proposed features are produced by an utterance-specific transform, allowing the features to be robust to the noise occurring in each utterance. Preliminary results on the Aurora 4 + DEMAND noise corpus show that our proposed features perform comparably to baseline acoustic features, including features calculated from a convolutive NMF (CNMF) model. Moreover, on unseen noises, our proposed features gives the most similar word error rate to clean speech compared to the baseline features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an algorithm combining Total Variability Modeling with Non-negative Matrix Factorization (NMF) to learn a total variability subspace and produce utterance-specific NMF dictionary adaptations for extracting noise-robust acoustic features from noisy speech. Unlike prior methods, it requires no parallel clean-noisy training corpus. Preliminary experiments on the Aurora 4 corpus mixed with DEMAND noises report that the resulting features yield word error rates (WER) comparable to standard and convolutive-NMF baselines, and the closest match to clean-speech WER when tested on unseen noises.

Significance. If the central modeling step can be shown to work, the result would be useful because it removes the parallel-data requirement that limits many noise-robust feature extractors and replaces it with an utterance-specific transform that can in principle track noise variation within a single recording. The reported metric (WER proximity to clean) directly tests the intended robustness outcome.

major comments (2)
  1. [Abstract] Abstract: the central claim that a total-variability subspace learned without parallel clean-noisy pairs can produce a useful utterance-specific transform is stated but never accompanied by the estimation procedure, the adaptation equations, or any derivation showing how the subspace is applied to an NMF dictionary. This information is load-bearing for both the “no parallel data” advantage and the reported WER results.
  2. [Abstract] Abstract: results are labeled “preliminary” with no mention of the number of utterances, cross-validation folds, statistical significance tests, or error bars on the WER figures. Without these, it is impossible to evaluate whether the claim that the proposed features give “the most similar word error rate to clean speech” on unseen noises is reliable.
minor comments (1)
  1. [Abstract] The sentence “our proposed features gives the most similar word error rate” contains a subject-verb agreement error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that a total-variability subspace learned without parallel clean-noisy pairs can produce a useful utterance-specific transform is stated but never accompanied by the estimation procedure, the adaptation equations, or any derivation showing how the subspace is applied to an NMF dictionary. This information is load-bearing for both the “no parallel data” advantage and the reported WER results.

    Authors: The abstract is a concise summary. The estimation procedure for learning the total variability subspace from non-parallel noisy data, the adaptation equations, and the full derivation of how the subspace produces an utterance-specific NMF dictionary transform are presented in detail in Sections 2 and 3 of the manuscript. These sections explicitly show the no-parallel-data training path and how the adapted dictionaries yield the reported features. revision: no

  2. Referee: [Abstract] Abstract: results are labeled “preliminary” with no mention of the number of utterances, cross-validation folds, statistical significance tests, or error bars on the WER figures. Without these, it is impossible to evaluate whether the claim that the proposed features give “the most similar word error rate to clean speech” on unseen noises is reliable.

    Authors: We agree that the abstract would benefit from additional experimental context. The manuscript body reports results on the standard Aurora 4 training and test sets (with the specific number of utterances and the DEMAND noise mixing procedure) using conventional train/test partitions. We will revise the abstract to reference the corpus scale and note that further statistical analysis (including error bars) can be included in a revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivation steps that reduce to their own inputs by construction. The central claim is an empirical observation on Aurora 4 + DEMAND (comparable WER to baselines, closest to clean on unseen noise) obtained from an utterance-specific transform learned via total variability modeling on NMF dictionaries. No self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain is exhibited; the method is explicitly positioned as avoiding parallel clean-noisy data, and the reported metric directly tests the modeling goal without internal reduction to the input assumptions. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes that total variability subspaces exist and can be estimated from noisy data alone.

pith-pipeline@v0.9.0 · 5685 in / 1023 out tokens · 18398 ms · 2026-05-24T20:51:45.287753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Speech offers a natural and efficient way to inter- act with these devices

    Introduction Automatic speech recognition (ASR) systems are being increas- ingly deployed on a wide range of devices for a wide range of applications. Speech offers a natural and efficient way to inter- act with these devices. Furthermore, speech contains paralin- guistic content that devices can use to modify their outputs or behavior. For example, ASR sy...

  2. [2]

    The training set does not require parallel clean and noisy utterances, and

  3. [3]

    In the following sections, we provide a brief overview of NMF and total variability modeling, followed by our proposed noise- robust acoustic feature algorithm

    The dictionary can be adapted for each utterance at test time, allowing for better modeling of the acoustic condi- tions in each utterance. In the following sections, we provide a brief overview of NMF and total variability modeling, followed by our proposed noise- robust acoustic feature algorithm. Section 4 describes our ex- periments and offers insight...

  4. [4]

    Towards Adapting NMF Dictionaries Using Total Variability Modeling for Noise-Robust Acoustic Features

    Background 2.1. Non-negative Matrix Factorization NMF decomposes a non-negative matrix V ∈ Rd×t + into the product of a non-negative dictionary W ∈ Rd×k + and non- negative activation matrix H ∈ Rk×t + . Because of the non- negative constraint, the decomposition is purely additive, and arXiv:1907.06859v1 [eess.AS] 16 Jul 2019 Figure 1: Visualizing the dic...

  5. [5]

    The idea is for the dictionary to capture as much of the noise in the spectrogram as possible so that the activation matrix is not af- fected by noise

    Algorithm In this section, we describe an algorithm that uses TVM to adapt an NMF dictionary to the noise in an input spectrogram. The idea is for the dictionary to capture as much of the noise in the spectrogram as possible so that the activation matrix is not af- fected by noise. We will use the activation matrix as acoustic features for ASR on noisy sp...

  6. [6]

    dliv- ing

    Experiments and Results We investigated the performance of our algorithm on the clean speech in the Aurora 4 corpus [20] with added noise from the DEMAND dataset [21]. The training set consists of 7138 ut- terances from the Aurora 4 training set corrupted by one of six different noises (labeled in the DEMAND dataset as “dliv- ing”, “npark”, “omeeting”, “p...

  7. [7]

    The algorithm uses Total Vari- ability Modeling to learn a total variability subspace and adapt a UBM NMF dictionary for each utterance at test time

    Conclusion We proposed an algorithm to calculate noise-robust acoustic features from noisy utterances. The algorithm uses Total Vari- ability Modeling to learn a total variability subspace and adapt a UBM NMF dictionary for each utterance at test time. We use the NMF activation matrix corresponding to the adapted dictio- nary as the acoustic features. Thu...

  8. [8]

    Anger de- tection in call center dialogues,

    D. Pappas, I. Androutsopoulos, and H. Papageorgiou, “Anger de- tection in call center dialogues,” in IEEE Int. Conf. Cognitive In- focommunications, Gy¨or, Hungary, 2015, pp. 139–144

  9. [9]

    Evaluation of a noise- robust dsr front-end on aurora databases,

    D. Macho, L. Mauury, B. No ´e, Y . M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce, and F. Saadoun, “Evaluation of a noise- robust dsr front-end on aurora databases,” in Proc. Int. Conf. Spo- ken Lang. Process., 2002, pp. 17–20

  10. [10]

    Noise model transfer: novel ap- proach to robustness against nonstationary noise,

    T. Yoshioka and T. Nakatani, “Noise model transfer: novel ap- proach to robustness against nonstationary noise,” IEEE Trans. Acoustics, Speech, and Lang. Process. , vol. 21, no. 10, pp. 2182– 2192, Oct. 2013

  11. [11]

    Evaluation of the splice al- gorithm on the aurora2 database,

    J. Droppo, A. Acero, and L. Deng, “Evaluation of the splice al- gorithm on the aurora2 database,” in Proc. Eurospeech, 2001, pp. 217–220

  12. [12]

    Noise adaptive training for robust automatic speech recognition,

    O. Kalinli, M. L. Seltzer, J. Droppo, and A. Acero, “Noise adaptive training for robust automatic speech recognition,” IEEE Trans. Acoustics, Speech, and Lang. Process. , vol. 18, no. 8, pp. 1889–1901, Nov. 2010

  13. [13]

    Speaker and noise factorization for robust speech recognition,

    Y . Wang and M. J. F. Gales, “Speaker and noise factorization for robust speech recognition,” IEEE Trans. Acoustics, Speech, and Lang. Process., vol. 20, no. 7, pp. 2149–2158, Sep. 2012

  14. [14]

    Suppression of acoustic noise in speech using spectral subtraction,

    S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoustics, Speech, and Signal Process., vol. 20, no. 2, pp. 113–120, Apr. 1979

  15. [15]

    Positive matrix factorization: A non- negative factor model with optimal utilization of error estimates of data values,

    P. Paatero and U. Tapper, “Positive matrix factorization: A non- negative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994

  16. [16]

    Algorithms for non-negative matrix factorization,

    D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Adv. in Neu. Info. Proc. Sys. 13 , 2001, pp. 556– 562

  17. [17]

    An investigation of deep neu- ral networks for noise robust speech recognition,

    M. L. Seltzer, D. Yu, and Y . Wang, “An investigation of deep neu- ral networks for noise robust speech recognition,” in IEEE Proc. Int. Conf. Acoustics, Speech, and Signal Process., 2013, pp. 7398– 7402

  18. [18]

    Investigation of speech separation as a front-end for noise robust speech recognition,

    A. Narayanan and D. L. Wang, “Investigation of speech separation as a front-end for noise robust speech recognition,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 22, no. 4, pp. 826– 835, 2014

  19. [19]

    A vector taylor series ap- proach for environment-independent speech recognition,

    P. J. Moreno, B. Raj, and R. M. Stern, “A vector taylor series ap- proach for environment-independent speech recognition,” inIEEE Proc. Int. Conf. Acoustics, Speech, and Signal Process. , 1996, pp. 733–736 vol. 2

  20. [20]

    High- performance robust speech recognition using stereo training data,

    L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang, “High- performance robust speech recognition using stereo training data,” in IEEE Proc. Int. Conf. Acoustics, Speech, and Signal Process. , 2001, pp. 301–304

  21. [21]

    Power-normalized cepstral coefficients (pncc) for robust speech recognition,

    C. Kim and R. M. Stern, “Power-normalized cepstral coefficients (pncc) for robust speech recognition,” in IEEE Proc. Int. Conf. Acoustics, Speech, and Signal Process. , 2012, pp. 4101–4104

  22. [22]

    Cnmf- based acoustic features for noise-robust asr,

    C. Vaz, D. Dimitriadis, S. Thomas, and S. Narayanan, “Cnmf- based acoustic features for noise-robust asr,” in IEEE Proc. Int. Conf. Acoustics, Speech, and Signal Process. , 2016, pp. 5735– 5739

  23. [23]

    Front-end factor analysis for speaker verification,

    N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech, and Lang. Process. , vol. 19, no. 4, pp. 788–798, 2010

  24. [24]

    On the use of the beta divergence for musical source separation,

    D. FitzGerald, M. Cranitch, and E. Coyle, “On the use of the beta divergence for musical source separation,” in IET Irish Signals and Systems Conf., 2009

  25. [25]

    Sparse nmf-half-baked or well done?

    J. Le Roux, F. Weninger, and J. Hershey, “Sparse nmf-half-baked or well done?” Mitsubishi Elect. Res. Lab. Cambridge, MA, USA, Tech. Rep. TR- 2015-023, 2015

  26. [26]

    Non-negative matrix factorization with sparseness constraints,

    P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” J. Machine Learning Research , vol. 5, pp. 1457– 1469, 2004

  27. [27]

    Analysis of the Aurora large vocabulary evaluations,

    N. Parihar and J. Picone, “Analysis of the Aurora large vocabulary evaluations,” in Proc. Eurospeech, 2003, pp. 337–340

  28. [28]

    The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,

    J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” Proc. Meetings on Acoustics, vol. 19, no. 1, 2013

  29. [29]

    Lib- rispeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” in Int. Conf. Acoustics, Speech, Signal Process. , 2015