pith. sign in

arxiv: 1906.11913 · v1 · pith:BUPRKHTMnew · submitted 2019-06-27 · 📡 eess.AS · eess.SP

Multiple Sound Source Localization with SVD-PHAT

Pith reviewed 2026-05-25 13:42 UTC · model grok-4.3

classification 📡 eess.AS eess.SP
keywords sound source localizationSVD-PHATSRP-PHATmultiple sourcesphase transformsingular value decompositionacoustic array processingreal-time localization
0
0 comments X

The pith

SVD-PHAT localizes multiple sound sources more accurately than discrete SRP-PHAT by reducing root mean square error up to 0.0395 radians.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a modification of the phase transform that uses singular value decomposition, called SVD-PHAT, to handle multiple simultaneous sound sources. The method performs repeated scans across the search space and projects each low-dimensional observation onto orthogonal subspaces to achieve source separation. It keeps overall computational cost low enough to support real-time operation. The work shows concrete gains over discrete SRP-PHAT in localization accuracy. A reader would care because reliable multi-source direction finding supports practical systems such as robot audition or conference audio without added hardware.

Core claim

The paper claims that SVD-PHAT localizes multiple sound sources more accurately than discrete SRP-PHAT. It achieves this by running multiple scans of the search space and projecting each low-dimensional observation onto orthogonal subspaces. The result is a reduction in root mean square error of up to 0.0395 radians while preserving low algorithm complexity suitable for real-time use.

What carries the argument

SVD-PHAT, the phase transform modified via singular value decomposition, which performs source separation by repeated scans combined with projection of observations onto orthogonal subspaces.

If this is right

  • Multiple sources can be localized simultaneously with measurable error reduction compared with discrete SRP-PHAT.
  • The algorithm stays computationally light enough for real-time deployment on modest hardware.
  • Orthogonal subspace projections during repeated scans provide the separation mechanism without explicit source counting.
  • Accuracy gains hold across the tested conditions while complexity remains comparable to the baseline method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scan-and-project pattern could be tested on moving sources if scan speed is increased.
  • Similar subspace projections might transfer to other array-based sensing tasks such as radar direction finding.
  • Performance under varying room acoustics or with different array geometries remains open for direct measurement.

Load-bearing premise

The method will separate multiple sources effectively when low-dimensional observations are projected onto orthogonal subspaces during repeated scans, without losing real-time feasibility.

What would settle it

Run a controlled test with two or more simultaneous sound sources using the same microphone array; if SVD-PHAT does not produce a lower root mean square error than discrete SRP-PHAT, or if sources remain unresolved, the accuracy claim does not hold.

Figures

Figures reproduced from arXiv: 1906.11913 by Francois Grondin, James Glass.

Figure 1
Figure 1. Figure 1: shows the estimated DOAs obtained with SRP￾PHAT and SVD-PHAT for a 1-D array with three speech sources located at −1.2192 rad, −0.4335 rad and 0.4015 rad, and a reverberation time (RT60) of 238 msecs. In this example, the SRP-PHAT method fails to detect the source at −0.4335 rad at different times, whereas SVD-PHAT detects this source most of the time. The RMSEs of SRP-PHAT and SVD-PHAT correspond to 0.300… view at source ↗
read the original abstract

This paper introduces a modification of phase transform on singular value decomposition (SVD-PHAT) to localize multiple sound sources. This work aims to improve localization accuracy and keeps the algorithm complexity low for real-time applications. This method relies on multiple scans of the search space, with projection of each low-dimensional observation onto orthogonal subspaces. We show that this method localizes multiple sound sources more accurately than discrete SRP-PHAT, with a reduction in the Root Mean Square Error up to 0.0395 radians.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SVD-PHAT, a modification of the phase transform using singular value decomposition, for localizing multiple sound sources. The approach performs multiple scans of the search space and projects low-dimensional observations onto successive orthogonal subspaces. It claims this yields more accurate localization than discrete SRP-PHAT, with an RMSE reduction of up to 0.0395 radians, while preserving low complexity suitable for real-time applications.

Significance. If the separation conditions and accuracy gains are rigorously validated, the method could provide a computationally efficient alternative for real-time multi-source localization in audio signal processing applications. The emphasis on maintaining real-time feasibility is a positive aspect, though the absence of supporting derivations and experimental details currently limits the assessed impact.

major comments (2)
  1. [Abstract] Abstract: The central claim of an RMSE reduction 'up to 0.0395 radians' versus discrete SRP-PHAT is presented without any description of the experimental setup, datasets, number of sources, SNR/reverberation conditions, number of trials, or error bars. This information is load-bearing for the accuracy improvement assertion.
  2. [Method] Method description: The procedure of repeated scans combined with projection onto orthogonal subspaces is asserted to separate multiple sources while preserving real-time complexity, but no derivation, theorem, inequality, or explicit conditions (e.g., minimum angular separation, SNR regime) are supplied showing when the projected residual contains energy only from remaining sources without crosstalk. This is load-bearing for the multi-source localization claim.
minor comments (1)
  1. [Abstract] Abstract: The sentence 'This work aims to improve localization accuracy and keeps the algorithm complexity low' has a subject-verb agreement issue; 'keeps' should be 'keep' for parallelism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of an RMSE reduction 'up to 0.0395 radians' versus discrete SRP-PHAT is presented without any description of the experimental setup, datasets, number of sources, SNR/reverberation conditions, number of trials, or error bars. This information is load-bearing for the accuracy improvement assertion.

    Authors: We agree that the abstract would be strengthened by including brief context for the reported RMSE reduction. In the revised manuscript we will expand the abstract to mention the number of sources tested, the range of SNR and reverberation conditions, the number of trials, and that error bars were computed across trials. revision: yes

  2. Referee: [Method] Method description: The procedure of repeated scans combined with projection onto orthogonal subspaces is asserted to separate multiple sources while preserving real-time complexity, but no derivation, theorem, inequality, or explicit conditions (e.g., minimum angular separation, SNR regime) are supplied showing when the projected residual contains energy only from remaining sources without crosstalk. This is load-bearing for the multi-source localization claim.

    Authors: The manuscript describes the algorithmic steps but does not supply the requested formal derivation or separation conditions. We will add a short theoretical subsection that derives the orthogonality property of the successive projections and states the minimum angular separation and SNR regime under which crosstalk is provably negligible. revision: yes

Circularity Check

0 steps flagged

No circularity; algorithmic modification presented without reduction to inputs

full rationale

The paper describes SVD-PHAT as a direct modification of phase transform using multiple scans and orthogonal subspace projections on low-dimensional observations. The central claim is an empirical RMSE reduction versus discrete SRP-PHAT. No equations, parameters, or steps are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method is self-contained as an algorithmic change with reported experimental comparison; absence of a separation theorem is a completeness issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; assessment is limited to the high-level claim.

pith-pipeline@v0.9.0 · 5600 in / 953 out tokens · 29232 ms · 2026-05-25T13:42:10.847186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    While humans can usually perform this task efficiently, distant speech pr o- cessing remains challenging for automatic speech recognit ion (ASR) systems [1]

    Introduction The cocktail party effect consists of the ability to focus on a specific conversation in a noisy environment. While humans can usually perform this task efficiently, distant speech pr o- cessing remains challenging for automatic speech recognit ion (ASR) systems [1]. To improve ASR performances, it is com- mon to use a beamformer with multiple ...

  2. [2]

    Let X l m[k] ∈ C be the Short Time Fourier Transform (STFT) coefficients, where N ∈ N and ∆N ∈ N stand for the frame and hop sizes in samples, respectively, a nd k∈{ 0, 1,

    SRP-PHAT We first introduce SRP-PHA T with rounded TDOA that allows efficient localization of multiple sound sources with arbit rary array shapes. Let X l m[k] ∈ C be the Short Time Fourier Transform (STFT) coefficients, where N ∈ N and ∆N ∈ N stand for the frame and hop sizes in samples, respectively, a nd k∈{ 0, 1, . . . , N/2}, m∈M ={1, 2, . . . , M} and ...

  3. [3]

    Let us define the vector Xi,j ∈ C(N/2+1)× 1 for the microphone pair (i, j)∈P that holds the phase normalized cross-correlation coefficients for all bi ns k∈ {0, 1,

    SVD-PHAT To define the SVD-PHA T method, it is convenient to start from SRP-PHA T in matrix form. Let us define the vector Xi,j ∈ C(N/2+1)× 1 for the microphone pair (i, j)∈P that holds the phase normalized cross-correlation coefficients for all bi ns k∈ {0, 1, . . . , N/2}: Xi,j = [ ˆXi,j [0] ˆXi,j [1] ··· ˆXi,j [N/2] ] T (8) where{. . .}T stands for the tr...

  4. [4]

    The micro - phones xyz-positions with respect to the center of the array are given in cm in Table 1

    RESULTS We investigate three different microphone array geometrie s: a 1-D linear array, a 2-D planar array and a 3-D array. The micro - phones xyz-positions with respect to the center of the array are given in cm in Table 1. Simulations are conducted to measure the accuracy of the proposed method and compare it to the SRP-PHA T approach discretized with ...

  5. [5]

    This technique outperforms the discrete SRP-P HA T approach in terms of accuracy, while preserving the low com- plexity of the original SVD-PHA T

    CONCLUSION This paper extends SVD-PHA T for multiple sound source lo- calization. This technique outperforms the discrete SRP-P HA T approach in terms of accuracy, while preserving the low com- plexity of the original SVD-PHA T. On average, the reductionin the RMSE varies between 0.0244 and 0.0395 radians, and the best improvement is observed for an array...

  6. [6]

    A study of enhancement, augmentation, and autoencoder methods for do - main adaptation in distant speech recognition,

    H. Tang, W.-N. Hsu, F. Grondin, and J. Glass, “A study of enhancement, augmentation, and autoencoder methods for do - main adaptation in distant speech recognition,” in Proc. INTER- SPEECH, 2018, pp. 2928–2932

  7. [7]

    BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge,

    J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge,” in Proc. IEEE ASRU, 2015, pp. 444–451

  8. [8]

    Deep Neural Network-based Speec h Separation Combining with MVDR Beamformer for Automatic Speech Recognition System,

    B.-K. Lee and J. Jeong, “Deep Neural Network-based Speec h Separation Combining with MVDR Beamformer for Automatic Speech Recognition System,” in Proc. IEEE ICCE , 2019, pp. 1– 4

  9. [9]

    Effect of steeri ng vector estimation on MVDR beamformer for noisy speech recog - nition,

    X. Sun, Z. Wang, R. Xia, J. Li, and Y . Y an, “Effect of steeri ng vector estimation on MVDR beamformer for noisy speech recog - nition,” in Proc. IEEE DSP, 2018, pp. 1–5

  10. [10]

    New insights into the MVDR beamformer in room acoustics,

    E. Habets, J. Benesty, I. Cohen, S. Gannot, and J. Dmochow ski, “New insights into the MVDR beamformer in room acoustics,” IEEE Transactions on Audio, Speech, and Language Processin g, vol. 18, no. 1, p. 158, 2010

  11. [11]

    Geometric source separatio n: Merging convolutive source separation with geometric beamform- ing,

    L. C. Parra and C. V . Alvino, “Geometric source separatio n: Merging convolutive source separation with geometric beamform- ing,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 6, pp. 352–362, 2002

  12. [12]

    Enhanced robot au dition based on microphone array source separation with post-filte r,

    J.-M. V alin, J. Rouat, and F. Michaud, “Enhanced robot au dition based on microphone array source separation with post-filte r,” in Proc. IEEE/RSJ IROS, vol. 3, 2004, pp. 2123–2128

  13. [13]

    Multiple emitter location and signal param eter estimation,

    R. Schmidt, “Multiple emitter location and signal param eter estimation,” IEEE Transactions on Antennas and Propagation , vol. 34, no. 3, pp. 276–280, 1986

  14. [14]

    Estimation of signal parame- ters via rotational invariance techniques - ESPRIT,

    R. Roy, A. Paulraj, and T. Kailath, “Estimation of signal parame- ters via rotational invariance techniques - ESPRIT,” in Proc. IEEE MILCOM, 1986

  15. [15]

    Evaluat ion of a MUSIC-based real-time sound localization of multiple sou nd sources in real noisy environments,

    C. Ishi, O. Chatot, H. Ishiguro, and N. Hagita, “Evaluat ion of a MUSIC-based real-time sound localization of multiple sou nd sources in real noisy environments,” in Proc. IEEE/RSJ IROS , 2009, pp. 2027–2032

  16. [16]

    Intelli gent sound source localization and its application to multimodal human tracking,

    K. Nakamura, K. Nakadai, F. Asano, and G. Ince, “Intelli gent sound source localization and its application to multimodal human tracking,” in Proc. IEEE/RSJ IROS, 2011, pp. 143–148

  17. [17]

    A real-time supe r resolution robot audition system that improves the robustn ess of simultaneous speech recognition,

    K. Nakamura, K. Nakadai, and H. Okuno, “A real-time supe r resolution robot audition system that improves the robustn ess of simultaneous speech recognition,” Advanced Robotics , vol. 27, no. 12, pp. 933–945, 2013

  18. [18]

    EB-ESPRIT: 2D localizat ion of mulitple wideband acoustic sources using eigen-beams,

    H. Teutsch and W. Kellermann, “EB-ESPRIT: 2D localizat ion of mulitple wideband acoustic sources using eigen-beams,” in Proc. ICASSP, 2005, pp. 89–92

  19. [19]

    Broadband variations of t he MUSIC high-resolution method for sound source localization in ro botics,

    S. Argentieri and P . Dan` es, “Broadband variations of t he MUSIC high-resolution method for sound source localization in ro botics,” in Proc. IEEE/RSJ IROS, 2007, pp. 2009–2014

  20. [20]

    Information-theoretic detection of broad- band sources in a coherent beamspace MUSIC scheme,

    P . Dan` es and J. Bonnal, “Information-theoretic detection of broad- band sources in a coherent beamspace MUSIC scheme,” in Proc. IEEE/RSJ IROS, 2010, pp. 1976–1981

  21. [21]

    Robust lo calization in reverberant rooms,

    J. DiBiase, H. Silverman, and M. Brandstein, “Robust lo calization in reverberant rooms,” inMicrophone Arrays. Springer, 2001, pp. 157–180

  22. [22]

    The ManyEars open framework,

    F. Grondin, D. L´ etourneau, F. Ferland, V . Rousseau, an d F. Michaud, “The ManyEars open framework,” Autonomous Robots, vol. 34, no. 3, pp. 217–232, 2013

  23. [23]

    Robust localiza tion and tracking of simultaneous moving sound sources using beamfo rm- ing and particle filtering,

    J.-M. V alin, F. Michaud, and J. Rouat, “Robust localiza tion and tracking of simultaneous moving sound sources using beamfo rm- ing and particle filtering,” Robotics and Autonomous Systems , vol. 55, no. 3, pp. 216–228, 2007

  24. [24]

    Local ization of simultaneous moving sound source for mobile robot using a frequency-domain steered beamformer approach,

    J.-M. V alin, F. Michaud, B. Hadjou, and J. Rouat, “Local ization of simultaneous moving sound source for mobile robot using a frequency-domain steered beamformer approach,” in Proc. IEEE ICRA, 2004, pp. 1033–1038

  25. [25]

    Robust 3D locali zation nad tracking of sound sources using beamforming and particl e fil- tering,

    J.-M. V alin, F. Michaud, and J. Rouat, “Robust 3D locali zation nad tracking of sound sources using beamforming and particl e fil- tering,” in Proc. IEEE ICASSP, 2006, pp. 841–844

  26. [26]

    Lightweight and optimized s ound source localization and tracking methods for open and close d mi- crophone array configurations,

    F. Grondin and F. Michaud, “Lightweight and optimized s ound source localization and tracking methods for open and close d mi- crophone array configurations,” Robotics and Autonomous Sys- tems, vol. 113, pp. 63–80, 2019

  27. [27]

    Fast sound source localization using two-level search space clustering,

    D. Y ook, T. Lee, and Y . Cho, “Fast sound source localization using two-level search space clustering,” IEEE Transactions on Cyber- netics, vol. 46, no. 1, pp. 20–26, 2016

  28. [28]

    Localization of multiple spe ech sources based on sub-band steered response power,

    W. Cai, X. Zhao, and Z. Wu, “Localization of multiple spe ech sources based on sub-band steered response power,” inProc. IEEE ICECE, 2010, pp. 1246–1249

  29. [29]

    Joint position-p itch esti- mation for multiple speaker scenarios,

    M. Kepesi, L. Ottowitz, and T. Habib, “Joint position-p itch esti- mation for multiple speaker scenarios,” in Proc. IEEE HSCMA , 2008, pp. 85–88

  30. [30]

    3D localization of multiple sound sources with intensity v ector estimates in single source zones,

    D. Pavlidi, S. Delikaris-Manias, V . Pulkki, and A. Mouc htaris, “3D localization of multiple sound sources with intensity v ector estimates in single source zones,” in Proc. IEEE EUSIPCO, 2015, pp. 1556–1560

  31. [31]

    Detection and localizat ion of multiple wideband acoustic sources based on wavefield decom - position using spherical apertures,

    H. Teutsch and W. Kellermann, “Detection and localizat ion of multiple wideband acoustic sources based on wavefield decom - position using spherical apertures,” in Proc. IEEE ICASSP, 2008, pp. 5276–5279

  32. [32]

    Multiple source loca lisation in the spherical harmonic domain,

    C. Evers, A. Moore, and P . Naylor, “Multiple source loca lisation in the spherical harmonic domain,” in Proc. IWAENC, 2014, pp. 258–262

  33. [33]

    Multiple source loc alization in the spherical harmonic domain using augmented intensity vec- tors based on grid search,

    S. Hafezi, A. Moore, and P . Naylor, “Multiple source loc alization in the spherical harmonic domain using augmented intensity vec- tors based on grid search,” in Proc. IEEE EUSIPCO , 2016, pp. 602–606

  34. [34]

    Robu st localization of multiple sources in reverberant environme nts us- ing EB-ESPRIT with spherical microphone arrays,

    H. Sun, H. Teutsch, E. Mabande, and W. Kellermann, “Robu st localization of multiple sources in reverberant environme nts us- ing EB-ESPRIT with spherical microphone arrays,” in Proc. IEEE ICASSP, 2011, pp. 117–120

  35. [35]

    Localization of multiple spe akers un- der high reverberation using a spherical microphone array a nd the direct-path dominance test,

    O. Nadiri and B. Rafaely, “Localization of multiple spe akers un- der high reverberation using a spherical microphone array a nd the direct-path dominance test,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 22, no. 10, pp. 1494– 1505, 2014

  36. [36]

    Rea l-time multiple sound source localization and counting using a cir cu- lar microphone array,

    D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris, “Rea l-time multiple sound source localization and counting using a cir cu- lar microphone array,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2193–2206, 2013

  37. [37]

    Acoustic source detecti on and localization based on wavefield decomposition using circul ar mi- crophone arrays,

    H. Teutsch and W. Kellermann, “Acoustic source detecti on and localization based on wavefield decomposition using circul ar mi- crophone arrays,” J. Acoust. Soc. Am. , vol. 120, no. 5, pp. 2724– 2736, 2006

  38. [38]

    SVD-PHA T: A fast sound source l ocal- ization method,

    F. Grondin and J. Glass, “SVD-PHA T: A fast sound source l ocal- ization method,” in Proc. IEEE ICASSP, 2019

  39. [39]

    Image method for efficiently si mulating small-room acoustics,

    J. Allen and D. Berkley, “Image method for efficiently si mulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979

  40. [40]

    Speech database develo pment at MIT: TIMIT and beyond,

    V . Zue, S. Seneff, and J. Glass, “Speech database develo pment at MIT: TIMIT and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990

  41. [41]

    The pyramid - technique: towards breaking the curse of dimensionality,

    S. Berchtold, C. B¨ ohm, and H.-P . Kriegal, “The pyramid - technique: towards breaking the curse of dimensionality,” in Proc. ACM SIGMOD Record, vol. 27, no. 2, 1998, pp. 142–153

  42. [42]

    Optimal 3D beamfor m- ing using measured microphone directivity patterns,

    M. Thomas, J. Ahrens, and I. Tashev, “Optimal 3D beamfor m- ing using measured microphone directivity patterns,” in Proc. IWAENC, 2012, pp. 1–4