Multiple Sound Source Localization with SVD-PHAT
Pith reviewed 2026-05-25 13:42 UTC · model grok-4.3
The pith
SVD-PHAT localizes multiple sound sources more accurately than discrete SRP-PHAT by reducing root mean square error up to 0.0395 radians.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that SVD-PHAT localizes multiple sound sources more accurately than discrete SRP-PHAT. It achieves this by running multiple scans of the search space and projecting each low-dimensional observation onto orthogonal subspaces. The result is a reduction in root mean square error of up to 0.0395 radians while preserving low algorithm complexity suitable for real-time use.
What carries the argument
SVD-PHAT, the phase transform modified via singular value decomposition, which performs source separation by repeated scans combined with projection of observations onto orthogonal subspaces.
If this is right
- Multiple sources can be localized simultaneously with measurable error reduction compared with discrete SRP-PHAT.
- The algorithm stays computationally light enough for real-time deployment on modest hardware.
- Orthogonal subspace projections during repeated scans provide the separation mechanism without explicit source counting.
- Accuracy gains hold across the tested conditions while complexity remains comparable to the baseline method.
Where Pith is reading between the lines
- The same scan-and-project pattern could be tested on moving sources if scan speed is increased.
- Similar subspace projections might transfer to other array-based sensing tasks such as radar direction finding.
- Performance under varying room acoustics or with different array geometries remains open for direct measurement.
Load-bearing premise
The method will separate multiple sources effectively when low-dimensional observations are projected onto orthogonal subspaces during repeated scans, without losing real-time feasibility.
What would settle it
Run a controlled test with two or more simultaneous sound sources using the same microphone array; if SVD-PHAT does not produce a lower root mean square error than discrete SRP-PHAT, or if sources remain unresolved, the accuracy claim does not hold.
Figures
read the original abstract
This paper introduces a modification of phase transform on singular value decomposition (SVD-PHAT) to localize multiple sound sources. This work aims to improve localization accuracy and keeps the algorithm complexity low for real-time applications. This method relies on multiple scans of the search space, with projection of each low-dimensional observation onto orthogonal subspaces. We show that this method localizes multiple sound sources more accurately than discrete SRP-PHAT, with a reduction in the Root Mean Square Error up to 0.0395 radians.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SVD-PHAT, a modification of the phase transform using singular value decomposition, for localizing multiple sound sources. The approach performs multiple scans of the search space and projects low-dimensional observations onto successive orthogonal subspaces. It claims this yields more accurate localization than discrete SRP-PHAT, with an RMSE reduction of up to 0.0395 radians, while preserving low complexity suitable for real-time applications.
Significance. If the separation conditions and accuracy gains are rigorously validated, the method could provide a computationally efficient alternative for real-time multi-source localization in audio signal processing applications. The emphasis on maintaining real-time feasibility is a positive aspect, though the absence of supporting derivations and experimental details currently limits the assessed impact.
major comments (2)
- [Abstract] Abstract: The central claim of an RMSE reduction 'up to 0.0395 radians' versus discrete SRP-PHAT is presented without any description of the experimental setup, datasets, number of sources, SNR/reverberation conditions, number of trials, or error bars. This information is load-bearing for the accuracy improvement assertion.
- [Method] Method description: The procedure of repeated scans combined with projection onto orthogonal subspaces is asserted to separate multiple sources while preserving real-time complexity, but no derivation, theorem, inequality, or explicit conditions (e.g., minimum angular separation, SNR regime) are supplied showing when the projected residual contains energy only from remaining sources without crosstalk. This is load-bearing for the multi-source localization claim.
minor comments (1)
- [Abstract] Abstract: The sentence 'This work aims to improve localization accuracy and keeps the algorithm complexity low' has a subject-verb agreement issue; 'keeps' should be 'keep' for parallelism.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of an RMSE reduction 'up to 0.0395 radians' versus discrete SRP-PHAT is presented without any description of the experimental setup, datasets, number of sources, SNR/reverberation conditions, number of trials, or error bars. This information is load-bearing for the accuracy improvement assertion.
Authors: We agree that the abstract would be strengthened by including brief context for the reported RMSE reduction. In the revised manuscript we will expand the abstract to mention the number of sources tested, the range of SNR and reverberation conditions, the number of trials, and that error bars were computed across trials. revision: yes
-
Referee: [Method] Method description: The procedure of repeated scans combined with projection onto orthogonal subspaces is asserted to separate multiple sources while preserving real-time complexity, but no derivation, theorem, inequality, or explicit conditions (e.g., minimum angular separation, SNR regime) are supplied showing when the projected residual contains energy only from remaining sources without crosstalk. This is load-bearing for the multi-source localization claim.
Authors: The manuscript describes the algorithmic steps but does not supply the requested formal derivation or separation conditions. We will add a short theoretical subsection that derives the orthogonality property of the successive projections and states the minimum angular separation and SNR regime under which crosstalk is provably negligible. revision: yes
Circularity Check
No circularity; algorithmic modification presented without reduction to inputs
full rationale
The paper describes SVD-PHAT as a direct modification of phase transform using multiple scans and orthogonal subspace projections on low-dimensional observations. The central claim is an empirical RMSE reduction versus discrete SRP-PHAT. No equations, parameters, or steps are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method is self-contained as an algorithmic change with reported experimental comparison; absence of a separation theorem is a completeness issue, not circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This method relies on multiple scans of the search space, with projection of each low-dimensional observation onto orthogonal subspaces.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Gram-Schmidt process then makes the current vector vr at scan r orthogonal to all the vectors previously found
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction The cocktail party effect consists of the ability to focus on a specific conversation in a noisy environment. While humans can usually perform this task efficiently, distant speech pr o- cessing remains challenging for automatic speech recognit ion (ASR) systems [1]. To improve ASR performances, it is com- mon to use a beamformer with multiple ...
-
[2]
SRP-PHAT We first introduce SRP-PHA T with rounded TDOA that allows efficient localization of multiple sound sources with arbit rary array shapes. Let X l m[k] ∈ C be the Short Time Fourier Transform (STFT) coefficients, where N ∈ N and ∆N ∈ N stand for the frame and hop sizes in samples, respectively, a nd k∈{ 0, 1, . . . , N/2}, m∈M ={1, 2, . . . , M} and ...
-
[3]
SVD-PHAT To define the SVD-PHA T method, it is convenient to start from SRP-PHA T in matrix form. Let us define the vector Xi,j ∈ C(N/2+1)× 1 for the microphone pair (i, j)∈P that holds the phase normalized cross-correlation coefficients for all bi ns k∈ {0, 1, . . . , N/2}: Xi,j = [ ˆXi,j [0] ˆXi,j [1] ··· ˆXi,j [N/2] ] T (8) where{. . .}T stands for the tr...
-
[4]
The micro - phones xyz-positions with respect to the center of the array are given in cm in Table 1
RESULTS We investigate three different microphone array geometrie s: a 1-D linear array, a 2-D planar array and a 3-D array. The micro - phones xyz-positions with respect to the center of the array are given in cm in Table 1. Simulations are conducted to measure the accuracy of the proposed method and compare it to the SRP-PHA T approach discretized with ...
work page 2027
-
[5]
CONCLUSION This paper extends SVD-PHA T for multiple sound source lo- calization. This technique outperforms the discrete SRP-P HA T approach in terms of accuracy, while preserving the low com- plexity of the original SVD-PHA T. On average, the reductionin the RMSE varies between 0.0244 and 0.0395 radians, and the best improvement is observed for an array...
-
[6]
H. Tang, W.-N. Hsu, F. Grondin, and J. Glass, “A study of enhancement, augmentation, and autoencoder methods for do - main adaptation in distant speech recognition,” in Proc. INTER- SPEECH, 2018, pp. 2928–2932
work page 2018
-
[7]
BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge,
J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge,” in Proc. IEEE ASRU, 2015, pp. 444–451
work page 2015
-
[8]
B.-K. Lee and J. Jeong, “Deep Neural Network-based Speec h Separation Combining with MVDR Beamformer for Automatic Speech Recognition System,” in Proc. IEEE ICCE , 2019, pp. 1– 4
work page 2019
-
[9]
Effect of steeri ng vector estimation on MVDR beamformer for noisy speech recog - nition,
X. Sun, Z. Wang, R. Xia, J. Li, and Y . Y an, “Effect of steeri ng vector estimation on MVDR beamformer for noisy speech recog - nition,” in Proc. IEEE DSP, 2018, pp. 1–5
work page 2018
-
[10]
New insights into the MVDR beamformer in room acoustics,
E. Habets, J. Benesty, I. Cohen, S. Gannot, and J. Dmochow ski, “New insights into the MVDR beamformer in room acoustics,” IEEE Transactions on Audio, Speech, and Language Processin g, vol. 18, no. 1, p. 158, 2010
work page 2010
-
[11]
Geometric source separatio n: Merging convolutive source separation with geometric beamform- ing,
L. C. Parra and C. V . Alvino, “Geometric source separatio n: Merging convolutive source separation with geometric beamform- ing,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 6, pp. 352–362, 2002
work page 2002
-
[12]
Enhanced robot au dition based on microphone array source separation with post-filte r,
J.-M. V alin, J. Rouat, and F. Michaud, “Enhanced robot au dition based on microphone array source separation with post-filte r,” in Proc. IEEE/RSJ IROS, vol. 3, 2004, pp. 2123–2128
work page 2004
-
[13]
Multiple emitter location and signal param eter estimation,
R. Schmidt, “Multiple emitter location and signal param eter estimation,” IEEE Transactions on Antennas and Propagation , vol. 34, no. 3, pp. 276–280, 1986
work page 1986
-
[14]
Estimation of signal parame- ters via rotational invariance techniques - ESPRIT,
R. Roy, A. Paulraj, and T. Kailath, “Estimation of signal parame- ters via rotational invariance techniques - ESPRIT,” in Proc. IEEE MILCOM, 1986
work page 1986
-
[15]
C. Ishi, O. Chatot, H. Ishiguro, and N. Hagita, “Evaluat ion of a MUSIC-based real-time sound localization of multiple sou nd sources in real noisy environments,” in Proc. IEEE/RSJ IROS , 2009, pp. 2027–2032
work page 2009
-
[16]
Intelli gent sound source localization and its application to multimodal human tracking,
K. Nakamura, K. Nakadai, F. Asano, and G. Ince, “Intelli gent sound source localization and its application to multimodal human tracking,” in Proc. IEEE/RSJ IROS, 2011, pp. 143–148
work page 2011
-
[17]
K. Nakamura, K. Nakadai, and H. Okuno, “A real-time supe r resolution robot audition system that improves the robustn ess of simultaneous speech recognition,” Advanced Robotics , vol. 27, no. 12, pp. 933–945, 2013
work page 2013
-
[18]
EB-ESPRIT: 2D localizat ion of mulitple wideband acoustic sources using eigen-beams,
H. Teutsch and W. Kellermann, “EB-ESPRIT: 2D localizat ion of mulitple wideband acoustic sources using eigen-beams,” in Proc. ICASSP, 2005, pp. 89–92
work page 2005
-
[19]
S. Argentieri and P . Dan` es, “Broadband variations of t he MUSIC high-resolution method for sound source localization in ro botics,” in Proc. IEEE/RSJ IROS, 2007, pp. 2009–2014
work page 2007
-
[20]
Information-theoretic detection of broad- band sources in a coherent beamspace MUSIC scheme,
P . Dan` es and J. Bonnal, “Information-theoretic detection of broad- band sources in a coherent beamspace MUSIC scheme,” in Proc. IEEE/RSJ IROS, 2010, pp. 1976–1981
work page 2010
-
[21]
Robust lo calization in reverberant rooms,
J. DiBiase, H. Silverman, and M. Brandstein, “Robust lo calization in reverberant rooms,” inMicrophone Arrays. Springer, 2001, pp. 157–180
work page 2001
-
[22]
F. Grondin, D. L´ etourneau, F. Ferland, V . Rousseau, an d F. Michaud, “The ManyEars open framework,” Autonomous Robots, vol. 34, no. 3, pp. 217–232, 2013
work page 2013
-
[23]
J.-M. V alin, F. Michaud, and J. Rouat, “Robust localiza tion and tracking of simultaneous moving sound sources using beamfo rm- ing and particle filtering,” Robotics and Autonomous Systems , vol. 55, no. 3, pp. 216–228, 2007
work page 2007
-
[24]
J.-M. V alin, F. Michaud, B. Hadjou, and J. Rouat, “Local ization of simultaneous moving sound source for mobile robot using a frequency-domain steered beamformer approach,” in Proc. IEEE ICRA, 2004, pp. 1033–1038
work page 2004
-
[25]
Robust 3D locali zation nad tracking of sound sources using beamforming and particl e fil- tering,
J.-M. V alin, F. Michaud, and J. Rouat, “Robust 3D locali zation nad tracking of sound sources using beamforming and particl e fil- tering,” in Proc. IEEE ICASSP, 2006, pp. 841–844
work page 2006
-
[26]
F. Grondin and F. Michaud, “Lightweight and optimized s ound source localization and tracking methods for open and close d mi- crophone array configurations,” Robotics and Autonomous Sys- tems, vol. 113, pp. 63–80, 2019
work page 2019
-
[27]
Fast sound source localization using two-level search space clustering,
D. Y ook, T. Lee, and Y . Cho, “Fast sound source localization using two-level search space clustering,” IEEE Transactions on Cyber- netics, vol. 46, no. 1, pp. 20–26, 2016
work page 2016
-
[28]
Localization of multiple spe ech sources based on sub-band steered response power,
W. Cai, X. Zhao, and Z. Wu, “Localization of multiple spe ech sources based on sub-band steered response power,” inProc. IEEE ICECE, 2010, pp. 1246–1249
work page 2010
-
[29]
Joint position-p itch esti- mation for multiple speaker scenarios,
M. Kepesi, L. Ottowitz, and T. Habib, “Joint position-p itch esti- mation for multiple speaker scenarios,” in Proc. IEEE HSCMA , 2008, pp. 85–88
work page 2008
-
[30]
3D localization of multiple sound sources with intensity v ector estimates in single source zones,
D. Pavlidi, S. Delikaris-Manias, V . Pulkki, and A. Mouc htaris, “3D localization of multiple sound sources with intensity v ector estimates in single source zones,” in Proc. IEEE EUSIPCO, 2015, pp. 1556–1560
work page 2015
-
[31]
H. Teutsch and W. Kellermann, “Detection and localizat ion of multiple wideband acoustic sources based on wavefield decom - position using spherical apertures,” in Proc. IEEE ICASSP, 2008, pp. 5276–5279
work page 2008
-
[32]
Multiple source loca lisation in the spherical harmonic domain,
C. Evers, A. Moore, and P . Naylor, “Multiple source loca lisation in the spherical harmonic domain,” in Proc. IWAENC, 2014, pp. 258–262
work page 2014
-
[33]
S. Hafezi, A. Moore, and P . Naylor, “Multiple source loc alization in the spherical harmonic domain using augmented intensity vec- tors based on grid search,” in Proc. IEEE EUSIPCO , 2016, pp. 602–606
work page 2016
-
[34]
H. Sun, H. Teutsch, E. Mabande, and W. Kellermann, “Robu st localization of multiple sources in reverberant environme nts us- ing EB-ESPRIT with spherical microphone arrays,” in Proc. IEEE ICASSP, 2011, pp. 117–120
work page 2011
-
[35]
O. Nadiri and B. Rafaely, “Localization of multiple spe akers un- der high reverberation using a spherical microphone array a nd the direct-path dominance test,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 22, no. 10, pp. 1494– 1505, 2014
work page 2014
-
[36]
Rea l-time multiple sound source localization and counting using a cir cu- lar microphone array,
D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris, “Rea l-time multiple sound source localization and counting using a cir cu- lar microphone array,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2193–2206, 2013
work page 2013
-
[37]
H. Teutsch and W. Kellermann, “Acoustic source detecti on and localization based on wavefield decomposition using circul ar mi- crophone arrays,” J. Acoust. Soc. Am. , vol. 120, no. 5, pp. 2724– 2736, 2006
work page 2006
-
[38]
SVD-PHA T: A fast sound source l ocal- ization method,
F. Grondin and J. Glass, “SVD-PHA T: A fast sound source l ocal- ization method,” in Proc. IEEE ICASSP, 2019
work page 2019
-
[39]
Image method for efficiently si mulating small-room acoustics,
J. Allen and D. Berkley, “Image method for efficiently si mulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979
work page 1979
-
[40]
Speech database develo pment at MIT: TIMIT and beyond,
V . Zue, S. Seneff, and J. Glass, “Speech database develo pment at MIT: TIMIT and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990
work page 1990
-
[41]
The pyramid - technique: towards breaking the curse of dimensionality,
S. Berchtold, C. B¨ ohm, and H.-P . Kriegal, “The pyramid - technique: towards breaking the curse of dimensionality,” in Proc. ACM SIGMOD Record, vol. 27, no. 2, 1998, pp. 142–153
work page 1998
-
[42]
Optimal 3D beamfor m- ing using measured microphone directivity patterns,
M. Thomas, J. Ahrens, and I. Tashev, “Optimal 3D beamfor m- ing using measured microphone directivity patterns,” in Proc. IWAENC, 2012, pp. 1–4
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.