Multiple Sound Source Localization with SVD-PHAT

Francois Grondin; James Glass

arxiv: 1906.11913 · v1 · pith:BUPRKHTMnew · submitted 2019-06-27 · 📡 eess.AS · eess.SP

Multiple Sound Source Localization with SVD-PHAT

Francois Grondin , James Glass This is my paper

Pith reviewed 2026-05-25 13:42 UTC · model grok-4.3

classification 📡 eess.AS eess.SP

keywords sound source localizationSVD-PHATSRP-PHATmultiple sourcesphase transformsingular value decompositionacoustic array processingreal-time localization

0 comments

The pith

SVD-PHAT localizes multiple sound sources more accurately than discrete SRP-PHAT by reducing root mean square error up to 0.0395 radians.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a modification of the phase transform that uses singular value decomposition, called SVD-PHAT, to handle multiple simultaneous sound sources. The method performs repeated scans across the search space and projects each low-dimensional observation onto orthogonal subspaces to achieve source separation. It keeps overall computational cost low enough to support real-time operation. The work shows concrete gains over discrete SRP-PHAT in localization accuracy. A reader would care because reliable multi-source direction finding supports practical systems such as robot audition or conference audio without added hardware.

Core claim

The paper claims that SVD-PHAT localizes multiple sound sources more accurately than discrete SRP-PHAT. It achieves this by running multiple scans of the search space and projecting each low-dimensional observation onto orthogonal subspaces. The result is a reduction in root mean square error of up to 0.0395 radians while preserving low algorithm complexity suitable for real-time use.

What carries the argument

SVD-PHAT, the phase transform modified via singular value decomposition, which performs source separation by repeated scans combined with projection of observations onto orthogonal subspaces.

If this is right

Multiple sources can be localized simultaneously with measurable error reduction compared with discrete SRP-PHAT.
The algorithm stays computationally light enough for real-time deployment on modest hardware.
Orthogonal subspace projections during repeated scans provide the separation mechanism without explicit source counting.
Accuracy gains hold across the tested conditions while complexity remains comparable to the baseline method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scan-and-project pattern could be tested on moving sources if scan speed is increased.
Similar subspace projections might transfer to other array-based sensing tasks such as radar direction finding.
Performance under varying room acoustics or with different array geometries remains open for direct measurement.

Load-bearing premise

The method will separate multiple sources effectively when low-dimensional observations are projected onto orthogonal subspaces during repeated scans, without losing real-time feasibility.

What would settle it

Run a controlled test with two or more simultaneous sound sources using the same microphone array; if SVD-PHAT does not produce a lower root mean square error than discrete SRP-PHAT, or if sources remain unresolved, the accuracy claim does not hold.

Figures

Figures reproduced from arXiv: 1906.11913 by Francois Grondin, James Glass.

**Figure 1.** Figure 1: shows the estimated DOAs obtained with SRPPHAT and SVD-PHAT for a 1-D array with three speech sources located at −1.2192 rad, −0.4335 rad and 0.4015 rad, and a reverberation time (RT60) of 238 msecs. In this example, the SRP-PHAT method fails to detect the source at −0.4335 rad at different times, whereas SVD-PHAT detects this source most of the time. The RMSEs of SRP-PHAT and SVD-PHAT correspond to 0.300… view at source ↗

read the original abstract

This paper introduces a modification of phase transform on singular value decomposition (SVD-PHAT) to localize multiple sound sources. This work aims to improve localization accuracy and keeps the algorithm complexity low for real-time applications. This method relies on multiple scans of the search space, with projection of each low-dimensional observation onto orthogonal subspaces. We show that this method localizes multiple sound sources more accurately than discrete SRP-PHAT, with a reduction in the Root Mean Square Error up to 0.0395 radians.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SVD-PHAT adds a subspace projection step to SRP-PHAT for multi-source cases and reports a small RMSE drop, but the supporting math and experiments are not visible in the abstract.

read the letter

SVD-PHAT modifies the standard phase transform by running multiple scans and projecting low-dimensional observations onto successive orthogonal subspaces derived from SVD. The claim is that this separates sources more cleanly than plain discrete SRP-PHAT while keeping the cost low enough for real time, with an RMSE reduction reaching 0.0395 radians. That is the core new piece: the explicit use of SVD subspaces to peel sources one by one instead of a single joint search. The paper does well by keeping the description short and by tying the change directly to the practical constraint of real-time operation. The algorithmic outline is easy to follow if you already know SRP-PHAT. The soft spot is the missing support for the separation claim. No derivation or inequality is given showing under what separation, SNR, or reverberation conditions the projected residual contains energy only from the remaining sources. The abstract also gives the accuracy number without any experimental setup, dataset description, number of trials, or error bars, so the result cannot be checked. If the full text supplies those details and a clear condition for when the projections succeed, the contribution becomes easier to evaluate; otherwise the accuracy statement rests on an unshown step. This is a paper for audio engineers or robotics groups already running SRP-PHAT who need a modest accuracy lift without extra compute. A reader who works on real-time localization will see the practical angle right away. It deserves a serious referee because the idea is concrete and the target application is active, even though the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper introduces SVD-PHAT, a modification of the phase transform using singular value decomposition, for localizing multiple sound sources. The approach performs multiple scans of the search space and projects low-dimensional observations onto successive orthogonal subspaces. It claims this yields more accurate localization than discrete SRP-PHAT, with an RMSE reduction of up to 0.0395 radians, while preserving low complexity suitable for real-time applications.

Significance. If the separation conditions and accuracy gains are rigorously validated, the method could provide a computationally efficient alternative for real-time multi-source localization in audio signal processing applications. The emphasis on maintaining real-time feasibility is a positive aspect, though the absence of supporting derivations and experimental details currently limits the assessed impact.

major comments (2)

[Abstract] Abstract: The central claim of an RMSE reduction 'up to 0.0395 radians' versus discrete SRP-PHAT is presented without any description of the experimental setup, datasets, number of sources, SNR/reverberation conditions, number of trials, or error bars. This information is load-bearing for the accuracy improvement assertion.
[Method] Method description: The procedure of repeated scans combined with projection onto orthogonal subspaces is asserted to separate multiple sources while preserving real-time complexity, but no derivation, theorem, inequality, or explicit conditions (e.g., minimum angular separation, SNR regime) are supplied showing when the projected residual contains energy only from remaining sources without crosstalk. This is load-bearing for the multi-source localization claim.

minor comments (1)

[Abstract] Abstract: The sentence 'This work aims to improve localization accuracy and keeps the algorithm complexity low' has a subject-verb agreement issue; 'keeps' should be 'keep' for parallelism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of an RMSE reduction 'up to 0.0395 radians' versus discrete SRP-PHAT is presented without any description of the experimental setup, datasets, number of sources, SNR/reverberation conditions, number of trials, or error bars. This information is load-bearing for the accuracy improvement assertion.

Authors: We agree that the abstract would be strengthened by including brief context for the reported RMSE reduction. In the revised manuscript we will expand the abstract to mention the number of sources tested, the range of SNR and reverberation conditions, the number of trials, and that error bars were computed across trials. revision: yes
Referee: [Method] Method description: The procedure of repeated scans combined with projection onto orthogonal subspaces is asserted to separate multiple sources while preserving real-time complexity, but no derivation, theorem, inequality, or explicit conditions (e.g., minimum angular separation, SNR regime) are supplied showing when the projected residual contains energy only from remaining sources without crosstalk. This is load-bearing for the multi-source localization claim.

Authors: The manuscript describes the algorithmic steps but does not supply the requested formal derivation or separation conditions. We will add a short theoretical subsection that derives the orthogonality property of the successive projections and states the minimum angular separation and SNR regime under which crosstalk is provably negligible. revision: yes

Circularity Check

0 steps flagged

No circularity; algorithmic modification presented without reduction to inputs

full rationale

The paper describes SVD-PHAT as a direct modification of phase transform using multiple scans and orthogonal subspace projections on low-dimensional observations. The central claim is an empirical RMSE reduction versus discrete SRP-PHAT. No equations, parameters, or steps are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method is self-contained as an algorithmic change with reported experimental comparison; absence of a separation theorem is a completeness issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; assessment is limited to the high-level claim.

pith-pipeline@v0.9.0 · 5600 in / 953 out tokens · 29232 ms · 2026-05-25T13:42:10.847186+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This method relies on multiple scans of the search space, with projection of each low-dimensional observation onto orthogonal subspaces.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Gram-Schmidt process then makes the current vector vr at scan r orthogonal to all the vectors previously found

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

While humans can usually perform this task efﬁciently, distant speech pr o- cessing remains challenging for automatic speech recognit ion (ASR) systems [1]

Introduction The cocktail party effect consists of the ability to focus on a speciﬁc conversation in a noisy environment. While humans can usually perform this task efﬁciently, distant speech pr o- cessing remains challenging for automatic speech recognit ion (ASR) systems [1]. To improve ASR performances, it is com- mon to use a beamformer with multiple ...

work page
[2]

Let X l m[k] ∈ C be the Short Time Fourier Transform (STFT) coefﬁcients, where N ∈ N and ∆N ∈ N stand for the frame and hop sizes in samples, respectively, a nd k∈{ 0, 1,

SRP-PHAT We ﬁrst introduce SRP-PHA T with rounded TDOA that allows efﬁcient localization of multiple sound sources with arbit rary array shapes. Let X l m[k] ∈ C be the Short Time Fourier Transform (STFT) coefﬁcients, where N ∈ N and ∆N ∈ N stand for the frame and hop sizes in samples, respectively, a nd k∈{ 0, 1, . . . , N/2}, m∈M ={1, 2, . . . , M} and ...

work page
[3]

Let us deﬁne the vector Xi,j ∈ C(N/2+1)× 1 for the microphone pair (i, j)∈P that holds the phase normalized cross-correlation coefﬁcients for all bi ns k∈ {0, 1,

SVD-PHAT To deﬁne the SVD-PHA T method, it is convenient to start from SRP-PHA T in matrix form. Let us deﬁne the vector Xi,j ∈ C(N/2+1)× 1 for the microphone pair (i, j)∈P that holds the phase normalized cross-correlation coefﬁcients for all bi ns k∈ {0, 1, . . . , N/2}: Xi,j = [ ˆXi,j [0] ˆXi,j [1] ··· ˆXi,j [N/2] ] T (8) where{. . .}T stands for the tr...

work page
[4]

The micro - phones xyz-positions with respect to the center of the array are given in cm in Table 1

RESULTS We investigate three different microphone array geometrie s: a 1-D linear array, a 2-D planar array and a 3-D array. The micro - phones xyz-positions with respect to the center of the array are given in cm in Table 1. Simulations are conducted to measure the accuracy of the proposed method and compare it to the SRP-PHA T approach discretized with ...

work page 2027
[5]

This technique outperforms the discrete SRP-P HA T approach in terms of accuracy, while preserving the low com- plexity of the original SVD-PHA T

CONCLUSION This paper extends SVD-PHA T for multiple sound source lo- calization. This technique outperforms the discrete SRP-P HA T approach in terms of accuracy, while preserving the low com- plexity of the original SVD-PHA T. On average, the reductionin the RMSE varies between 0.0244 and 0.0395 radians, and the best improvement is observed for an array...

work page
[6]

A study of enhancement, augmentation, and autoencoder methods for do - main adaptation in distant speech recognition,

H. Tang, W.-N. Hsu, F. Grondin, and J. Glass, “A study of enhancement, augmentation, and autoencoder methods for do - main adaptation in distant speech recognition,” in Proc. INTER- SPEECH, 2018, pp. 2928–2932

work page 2018
[7]

BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge,

J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge,” in Proc. IEEE ASRU, 2015, pp. 444–451

work page 2015
[8]

Deep Neural Network-based Speec h Separation Combining with MVDR Beamformer for Automatic Speech Recognition System,

B.-K. Lee and J. Jeong, “Deep Neural Network-based Speec h Separation Combining with MVDR Beamformer for Automatic Speech Recognition System,” in Proc. IEEE ICCE , 2019, pp. 1– 4

work page 2019
[9]

Effect of steeri ng vector estimation on MVDR beamformer for noisy speech recog - nition,

X. Sun, Z. Wang, R. Xia, J. Li, and Y . Y an, “Effect of steeri ng vector estimation on MVDR beamformer for noisy speech recog - nition,” in Proc. IEEE DSP, 2018, pp. 1–5

work page 2018
[10]

New insights into the MVDR beamformer in room acoustics,

E. Habets, J. Benesty, I. Cohen, S. Gannot, and J. Dmochow ski, “New insights into the MVDR beamformer in room acoustics,” IEEE Transactions on Audio, Speech, and Language Processin g, vol. 18, no. 1, p. 158, 2010

work page 2010
[11]

Geometric source separatio n: Merging convolutive source separation with geometric beamform- ing,

L. C. Parra and C. V . Alvino, “Geometric source separatio n: Merging convolutive source separation with geometric beamform- ing,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 6, pp. 352–362, 2002

work page 2002
[12]

Enhanced robot au dition based on microphone array source separation with post-ﬁlte r,

J.-M. V alin, J. Rouat, and F. Michaud, “Enhanced robot au dition based on microphone array source separation with post-ﬁlte r,” in Proc. IEEE/RSJ IROS, vol. 3, 2004, pp. 2123–2128

work page 2004
[13]

Multiple emitter location and signal param eter estimation,

R. Schmidt, “Multiple emitter location and signal param eter estimation,” IEEE Transactions on Antennas and Propagation , vol. 34, no. 3, pp. 276–280, 1986

work page 1986
[14]

Estimation of signal parame- ters via rotational invariance techniques - ESPRIT,

R. Roy, A. Paulraj, and T. Kailath, “Estimation of signal parame- ters via rotational invariance techniques - ESPRIT,” in Proc. IEEE MILCOM, 1986

work page 1986
[15]

Evaluat ion of a MUSIC-based real-time sound localization of multiple sou nd sources in real noisy environments,

C. Ishi, O. Chatot, H. Ishiguro, and N. Hagita, “Evaluat ion of a MUSIC-based real-time sound localization of multiple sou nd sources in real noisy environments,” in Proc. IEEE/RSJ IROS , 2009, pp. 2027–2032

work page 2009
[16]

Intelli gent sound source localization and its application to multimodal human tracking,

K. Nakamura, K. Nakadai, F. Asano, and G. Ince, “Intelli gent sound source localization and its application to multimodal human tracking,” in Proc. IEEE/RSJ IROS, 2011, pp. 143–148

work page 2011
[17]

A real-time supe r resolution robot audition system that improves the robustn ess of simultaneous speech recognition,

K. Nakamura, K. Nakadai, and H. Okuno, “A real-time supe r resolution robot audition system that improves the robustn ess of simultaneous speech recognition,” Advanced Robotics , vol. 27, no. 12, pp. 933–945, 2013

work page 2013
[18]

EB-ESPRIT: 2D localizat ion of mulitple wideband acoustic sources using eigen-beams,

H. Teutsch and W. Kellermann, “EB-ESPRIT: 2D localizat ion of mulitple wideband acoustic sources using eigen-beams,” in Proc. ICASSP, 2005, pp. 89–92

work page 2005
[19]

Broadband variations of t he MUSIC high-resolution method for sound source localization in ro botics,

S. Argentieri and P . Dan` es, “Broadband variations of t he MUSIC high-resolution method for sound source localization in ro botics,” in Proc. IEEE/RSJ IROS, 2007, pp. 2009–2014

work page 2007
[20]

Information-theoretic detection of broad- band sources in a coherent beamspace MUSIC scheme,

P . Dan` es and J. Bonnal, “Information-theoretic detection of broad- band sources in a coherent beamspace MUSIC scheme,” in Proc. IEEE/RSJ IROS, 2010, pp. 1976–1981

work page 2010
[21]

Robust lo calization in reverberant rooms,

J. DiBiase, H. Silverman, and M. Brandstein, “Robust lo calization in reverberant rooms,” inMicrophone Arrays. Springer, 2001, pp. 157–180

work page 2001
[22]

The ManyEars open framework,

F. Grondin, D. L´ etourneau, F. Ferland, V . Rousseau, an d F. Michaud, “The ManyEars open framework,” Autonomous Robots, vol. 34, no. 3, pp. 217–232, 2013

work page 2013
[23]

Robust localiza tion and tracking of simultaneous moving sound sources using beamfo rm- ing and particle ﬁltering,

J.-M. V alin, F. Michaud, and J. Rouat, “Robust localiza tion and tracking of simultaneous moving sound sources using beamfo rm- ing and particle ﬁltering,” Robotics and Autonomous Systems , vol. 55, no. 3, pp. 216–228, 2007

work page 2007
[24]

Local ization of simultaneous moving sound source for mobile robot using a frequency-domain steered beamformer approach,

J.-M. V alin, F. Michaud, B. Hadjou, and J. Rouat, “Local ization of simultaneous moving sound source for mobile robot using a frequency-domain steered beamformer approach,” in Proc. IEEE ICRA, 2004, pp. 1033–1038

work page 2004
[25]

Robust 3D locali zation nad tracking of sound sources using beamforming and particl e ﬁl- tering,

J.-M. V alin, F. Michaud, and J. Rouat, “Robust 3D locali zation nad tracking of sound sources using beamforming and particl e ﬁl- tering,” in Proc. IEEE ICASSP, 2006, pp. 841–844

work page 2006
[26]

Lightweight and optimized s ound source localization and tracking methods for open and close d mi- crophone array conﬁgurations,

F. Grondin and F. Michaud, “Lightweight and optimized s ound source localization and tracking methods for open and close d mi- crophone array conﬁgurations,” Robotics and Autonomous Sys- tems, vol. 113, pp. 63–80, 2019

work page 2019
[27]

Fast sound source localization using two-level search space clustering,

D. Y ook, T. Lee, and Y . Cho, “Fast sound source localization using two-level search space clustering,” IEEE Transactions on Cyber- netics, vol. 46, no. 1, pp. 20–26, 2016

work page 2016
[28]

Localization of multiple spe ech sources based on sub-band steered response power,

W. Cai, X. Zhao, and Z. Wu, “Localization of multiple spe ech sources based on sub-band steered response power,” inProc. IEEE ICECE, 2010, pp. 1246–1249

work page 2010
[29]

Joint position-p itch esti- mation for multiple speaker scenarios,

M. Kepesi, L. Ottowitz, and T. Habib, “Joint position-p itch esti- mation for multiple speaker scenarios,” in Proc. IEEE HSCMA , 2008, pp. 85–88

work page 2008
[30]

3D localization of multiple sound sources with intensity v ector estimates in single source zones,

D. Pavlidi, S. Delikaris-Manias, V . Pulkki, and A. Mouc htaris, “3D localization of multiple sound sources with intensity v ector estimates in single source zones,” in Proc. IEEE EUSIPCO, 2015, pp. 1556–1560

work page 2015
[31]

Detection and localizat ion of multiple wideband acoustic sources based on waveﬁeld decom - position using spherical apertures,

H. Teutsch and W. Kellermann, “Detection and localizat ion of multiple wideband acoustic sources based on waveﬁeld decom - position using spherical apertures,” in Proc. IEEE ICASSP, 2008, pp. 5276–5279

work page 2008
[32]

Multiple source loca lisation in the spherical harmonic domain,

C. Evers, A. Moore, and P . Naylor, “Multiple source loca lisation in the spherical harmonic domain,” in Proc. IWAENC, 2014, pp. 258–262

work page 2014
[33]

Multiple source loc alization in the spherical harmonic domain using augmented intensity vec- tors based on grid search,

S. Hafezi, A. Moore, and P . Naylor, “Multiple source loc alization in the spherical harmonic domain using augmented intensity vec- tors based on grid search,” in Proc. IEEE EUSIPCO , 2016, pp. 602–606

work page 2016
[34]

Robu st localization of multiple sources in reverberant environme nts us- ing EB-ESPRIT with spherical microphone arrays,

H. Sun, H. Teutsch, E. Mabande, and W. Kellermann, “Robu st localization of multiple sources in reverberant environme nts us- ing EB-ESPRIT with spherical microphone arrays,” in Proc. IEEE ICASSP, 2011, pp. 117–120

work page 2011
[35]

Localization of multiple spe akers un- der high reverberation using a spherical microphone array a nd the direct-path dominance test,

O. Nadiri and B. Rafaely, “Localization of multiple spe akers un- der high reverberation using a spherical microphone array a nd the direct-path dominance test,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 22, no. 10, pp. 1494– 1505, 2014

work page 2014
[36]

Rea l-time multiple sound source localization and counting using a cir cu- lar microphone array,

D. Pavlidi, A. Grifﬁn, M. Puigt, and A. Mouchtaris, “Rea l-time multiple sound source localization and counting using a cir cu- lar microphone array,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2193–2206, 2013

work page 2013
[37]

Acoustic source detecti on and localization based on waveﬁeld decomposition using circul ar mi- crophone arrays,

H. Teutsch and W. Kellermann, “Acoustic source detecti on and localization based on waveﬁeld decomposition using circul ar mi- crophone arrays,” J. Acoust. Soc. Am. , vol. 120, no. 5, pp. 2724– 2736, 2006

work page 2006
[38]

SVD-PHA T: A fast sound source l ocal- ization method,

F. Grondin and J. Glass, “SVD-PHA T: A fast sound source l ocal- ization method,” in Proc. IEEE ICASSP, 2019

work page 2019
[39]

Image method for efﬁciently si mulating small-room acoustics,

J. Allen and D. Berkley, “Image method for efﬁciently si mulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979

work page 1979
[40]

Speech database develo pment at MIT: TIMIT and beyond,

V . Zue, S. Seneff, and J. Glass, “Speech database develo pment at MIT: TIMIT and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990

work page 1990
[41]

The pyramid - technique: towards breaking the curse of dimensionality,

S. Berchtold, C. B¨ ohm, and H.-P . Kriegal, “The pyramid - technique: towards breaking the curse of dimensionality,” in Proc. ACM SIGMOD Record, vol. 27, no. 2, 1998, pp. 142–153

work page 1998
[42]

Optimal 3D beamfor m- ing using measured microphone directivity patterns,

M. Thomas, J. Ahrens, and I. Tashev, “Optimal 3D beamfor m- ing using measured microphone directivity patterns,” in Proc. IWAENC, 2012, pp. 1–4

work page 2012

[1] [1]

While humans can usually perform this task efﬁciently, distant speech pr o- cessing remains challenging for automatic speech recognit ion (ASR) systems [1]

Introduction The cocktail party effect consists of the ability to focus on a speciﬁc conversation in a noisy environment. While humans can usually perform this task efﬁciently, distant speech pr o- cessing remains challenging for automatic speech recognit ion (ASR) systems [1]. To improve ASR performances, it is com- mon to use a beamformer with multiple ...

work page

[2] [2]

Let X l m[k] ∈ C be the Short Time Fourier Transform (STFT) coefﬁcients, where N ∈ N and ∆N ∈ N stand for the frame and hop sizes in samples, respectively, a nd k∈{ 0, 1,

SRP-PHAT We ﬁrst introduce SRP-PHA T with rounded TDOA that allows efﬁcient localization of multiple sound sources with arbit rary array shapes. Let X l m[k] ∈ C be the Short Time Fourier Transform (STFT) coefﬁcients, where N ∈ N and ∆N ∈ N stand for the frame and hop sizes in samples, respectively, a nd k∈{ 0, 1, . . . , N/2}, m∈M ={1, 2, . . . , M} and ...

work page

[3] [3]

Let us deﬁne the vector Xi,j ∈ C(N/2+1)× 1 for the microphone pair (i, j)∈P that holds the phase normalized cross-correlation coefﬁcients for all bi ns k∈ {0, 1,

SVD-PHAT To deﬁne the SVD-PHA T method, it is convenient to start from SRP-PHA T in matrix form. Let us deﬁne the vector Xi,j ∈ C(N/2+1)× 1 for the microphone pair (i, j)∈P that holds the phase normalized cross-correlation coefﬁcients for all bi ns k∈ {0, 1, . . . , N/2}: Xi,j = [ ˆXi,j [0] ˆXi,j [1] ··· ˆXi,j [N/2] ] T (8) where{. . .}T stands for the tr...

work page

[4] [4]

The micro - phones xyz-positions with respect to the center of the array are given in cm in Table 1

RESULTS We investigate three different microphone array geometrie s: a 1-D linear array, a 2-D planar array and a 3-D array. The micro - phones xyz-positions with respect to the center of the array are given in cm in Table 1. Simulations are conducted to measure the accuracy of the proposed method and compare it to the SRP-PHA T approach discretized with ...

work page 2027

[5] [5]

This technique outperforms the discrete SRP-P HA T approach in terms of accuracy, while preserving the low com- plexity of the original SVD-PHA T

CONCLUSION This paper extends SVD-PHA T for multiple sound source lo- calization. This technique outperforms the discrete SRP-P HA T approach in terms of accuracy, while preserving the low com- plexity of the original SVD-PHA T. On average, the reductionin the RMSE varies between 0.0244 and 0.0395 radians, and the best improvement is observed for an array...

work page

[6] [6]

A study of enhancement, augmentation, and autoencoder methods for do - main adaptation in distant speech recognition,

H. Tang, W.-N. Hsu, F. Grondin, and J. Glass, “A study of enhancement, augmentation, and autoencoder methods for do - main adaptation in distant speech recognition,” in Proc. INTER- SPEECH, 2018, pp. 2928–2932

work page 2018

[7] [7]

BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge,

J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge,” in Proc. IEEE ASRU, 2015, pp. 444–451

work page 2015

[8] [8]

Deep Neural Network-based Speec h Separation Combining with MVDR Beamformer for Automatic Speech Recognition System,

B.-K. Lee and J. Jeong, “Deep Neural Network-based Speec h Separation Combining with MVDR Beamformer for Automatic Speech Recognition System,” in Proc. IEEE ICCE , 2019, pp. 1– 4

work page 2019

[9] [9]

Effect of steeri ng vector estimation on MVDR beamformer for noisy speech recog - nition,

X. Sun, Z. Wang, R. Xia, J. Li, and Y . Y an, “Effect of steeri ng vector estimation on MVDR beamformer for noisy speech recog - nition,” in Proc. IEEE DSP, 2018, pp. 1–5

work page 2018

[10] [10]

New insights into the MVDR beamformer in room acoustics,

E. Habets, J. Benesty, I. Cohen, S. Gannot, and J. Dmochow ski, “New insights into the MVDR beamformer in room acoustics,” IEEE Transactions on Audio, Speech, and Language Processin g, vol. 18, no. 1, p. 158, 2010

work page 2010

[11] [11]

Geometric source separatio n: Merging convolutive source separation with geometric beamform- ing,

L. C. Parra and C. V . Alvino, “Geometric source separatio n: Merging convolutive source separation with geometric beamform- ing,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 6, pp. 352–362, 2002

work page 2002

[12] [12]

Enhanced robot au dition based on microphone array source separation with post-ﬁlte r,

J.-M. V alin, J. Rouat, and F. Michaud, “Enhanced robot au dition based on microphone array source separation with post-ﬁlte r,” in Proc. IEEE/RSJ IROS, vol. 3, 2004, pp. 2123–2128

work page 2004

[13] [13]

Multiple emitter location and signal param eter estimation,

R. Schmidt, “Multiple emitter location and signal param eter estimation,” IEEE Transactions on Antennas and Propagation , vol. 34, no. 3, pp. 276–280, 1986

work page 1986

[14] [14]

Estimation of signal parame- ters via rotational invariance techniques - ESPRIT,

R. Roy, A. Paulraj, and T. Kailath, “Estimation of signal parame- ters via rotational invariance techniques - ESPRIT,” in Proc. IEEE MILCOM, 1986

work page 1986

[15] [15]

Evaluat ion of a MUSIC-based real-time sound localization of multiple sou nd sources in real noisy environments,

C. Ishi, O. Chatot, H. Ishiguro, and N. Hagita, “Evaluat ion of a MUSIC-based real-time sound localization of multiple sou nd sources in real noisy environments,” in Proc. IEEE/RSJ IROS , 2009, pp. 2027–2032

work page 2009

[16] [16]

Intelli gent sound source localization and its application to multimodal human tracking,

K. Nakamura, K. Nakadai, F. Asano, and G. Ince, “Intelli gent sound source localization and its application to multimodal human tracking,” in Proc. IEEE/RSJ IROS, 2011, pp. 143–148

work page 2011

[17] [17]

A real-time supe r resolution robot audition system that improves the robustn ess of simultaneous speech recognition,

K. Nakamura, K. Nakadai, and H. Okuno, “A real-time supe r resolution robot audition system that improves the robustn ess of simultaneous speech recognition,” Advanced Robotics , vol. 27, no. 12, pp. 933–945, 2013

work page 2013

[18] [18]

EB-ESPRIT: 2D localizat ion of mulitple wideband acoustic sources using eigen-beams,

H. Teutsch and W. Kellermann, “EB-ESPRIT: 2D localizat ion of mulitple wideband acoustic sources using eigen-beams,” in Proc. ICASSP, 2005, pp. 89–92

work page 2005

[19] [19]

Broadband variations of t he MUSIC high-resolution method for sound source localization in ro botics,

S. Argentieri and P . Dan` es, “Broadband variations of t he MUSIC high-resolution method for sound source localization in ro botics,” in Proc. IEEE/RSJ IROS, 2007, pp. 2009–2014

work page 2007

[20] [20]

Information-theoretic detection of broad- band sources in a coherent beamspace MUSIC scheme,

P . Dan` es and J. Bonnal, “Information-theoretic detection of broad- band sources in a coherent beamspace MUSIC scheme,” in Proc. IEEE/RSJ IROS, 2010, pp. 1976–1981

work page 2010

[21] [21]

Robust lo calization in reverberant rooms,

J. DiBiase, H. Silverman, and M. Brandstein, “Robust lo calization in reverberant rooms,” inMicrophone Arrays. Springer, 2001, pp. 157–180

work page 2001

[22] [22]

The ManyEars open framework,

F. Grondin, D. L´ etourneau, F. Ferland, V . Rousseau, an d F. Michaud, “The ManyEars open framework,” Autonomous Robots, vol. 34, no. 3, pp. 217–232, 2013

work page 2013

[23] [23]

Robust localiza tion and tracking of simultaneous moving sound sources using beamfo rm- ing and particle ﬁltering,

J.-M. V alin, F. Michaud, and J. Rouat, “Robust localiza tion and tracking of simultaneous moving sound sources using beamfo rm- ing and particle ﬁltering,” Robotics and Autonomous Systems , vol. 55, no. 3, pp. 216–228, 2007

work page 2007

[24] [24]

Local ization of simultaneous moving sound source for mobile robot using a frequency-domain steered beamformer approach,

J.-M. V alin, F. Michaud, B. Hadjou, and J. Rouat, “Local ization of simultaneous moving sound source for mobile robot using a frequency-domain steered beamformer approach,” in Proc. IEEE ICRA, 2004, pp. 1033–1038

work page 2004

[25] [25]

Robust 3D locali zation nad tracking of sound sources using beamforming and particl e ﬁl- tering,

J.-M. V alin, F. Michaud, and J. Rouat, “Robust 3D locali zation nad tracking of sound sources using beamforming and particl e ﬁl- tering,” in Proc. IEEE ICASSP, 2006, pp. 841–844

work page 2006

[26] [26]

Lightweight and optimized s ound source localization and tracking methods for open and close d mi- crophone array conﬁgurations,

F. Grondin and F. Michaud, “Lightweight and optimized s ound source localization and tracking methods for open and close d mi- crophone array conﬁgurations,” Robotics and Autonomous Sys- tems, vol. 113, pp. 63–80, 2019

work page 2019

[27] [27]

Fast sound source localization using two-level search space clustering,

D. Y ook, T. Lee, and Y . Cho, “Fast sound source localization using two-level search space clustering,” IEEE Transactions on Cyber- netics, vol. 46, no. 1, pp. 20–26, 2016

work page 2016

[28] [28]

Localization of multiple spe ech sources based on sub-band steered response power,

W. Cai, X. Zhao, and Z. Wu, “Localization of multiple spe ech sources based on sub-band steered response power,” inProc. IEEE ICECE, 2010, pp. 1246–1249

work page 2010

[29] [29]

Joint position-p itch esti- mation for multiple speaker scenarios,

M. Kepesi, L. Ottowitz, and T. Habib, “Joint position-p itch esti- mation for multiple speaker scenarios,” in Proc. IEEE HSCMA , 2008, pp. 85–88

work page 2008

[30] [30]

3D localization of multiple sound sources with intensity v ector estimates in single source zones,

D. Pavlidi, S. Delikaris-Manias, V . Pulkki, and A. Mouc htaris, “3D localization of multiple sound sources with intensity v ector estimates in single source zones,” in Proc. IEEE EUSIPCO, 2015, pp. 1556–1560

work page 2015

[31] [31]

Detection and localizat ion of multiple wideband acoustic sources based on waveﬁeld decom - position using spherical apertures,

H. Teutsch and W. Kellermann, “Detection and localizat ion of multiple wideband acoustic sources based on waveﬁeld decom - position using spherical apertures,” in Proc. IEEE ICASSP, 2008, pp. 5276–5279

work page 2008

[32] [32]

Multiple source loca lisation in the spherical harmonic domain,

C. Evers, A. Moore, and P . Naylor, “Multiple source loca lisation in the spherical harmonic domain,” in Proc. IWAENC, 2014, pp. 258–262

work page 2014

[33] [33]

Multiple source loc alization in the spherical harmonic domain using augmented intensity vec- tors based on grid search,

S. Hafezi, A. Moore, and P . Naylor, “Multiple source loc alization in the spherical harmonic domain using augmented intensity vec- tors based on grid search,” in Proc. IEEE EUSIPCO , 2016, pp. 602–606

work page 2016

[34] [34]

Robu st localization of multiple sources in reverberant environme nts us- ing EB-ESPRIT with spherical microphone arrays,

H. Sun, H. Teutsch, E. Mabande, and W. Kellermann, “Robu st localization of multiple sources in reverberant environme nts us- ing EB-ESPRIT with spherical microphone arrays,” in Proc. IEEE ICASSP, 2011, pp. 117–120

work page 2011

[35] [35]

Localization of multiple spe akers un- der high reverberation using a spherical microphone array a nd the direct-path dominance test,

O. Nadiri and B. Rafaely, “Localization of multiple spe akers un- der high reverberation using a spherical microphone array a nd the direct-path dominance test,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 22, no. 10, pp. 1494– 1505, 2014

work page 2014

[36] [36]

Rea l-time multiple sound source localization and counting using a cir cu- lar microphone array,

D. Pavlidi, A. Grifﬁn, M. Puigt, and A. Mouchtaris, “Rea l-time multiple sound source localization and counting using a cir cu- lar microphone array,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2193–2206, 2013

work page 2013

[37] [37]

Acoustic source detecti on and localization based on waveﬁeld decomposition using circul ar mi- crophone arrays,

H. Teutsch and W. Kellermann, “Acoustic source detecti on and localization based on waveﬁeld decomposition using circul ar mi- crophone arrays,” J. Acoust. Soc. Am. , vol. 120, no. 5, pp. 2724– 2736, 2006

work page 2006

[38] [38]

SVD-PHA T: A fast sound source l ocal- ization method,

F. Grondin and J. Glass, “SVD-PHA T: A fast sound source l ocal- ization method,” in Proc. IEEE ICASSP, 2019

work page 2019

[39] [39]

Image method for efﬁciently si mulating small-room acoustics,

J. Allen and D. Berkley, “Image method for efﬁciently si mulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979

work page 1979

[40] [40]

Speech database develo pment at MIT: TIMIT and beyond,

V . Zue, S. Seneff, and J. Glass, “Speech database develo pment at MIT: TIMIT and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990

work page 1990

[41] [41]

The pyramid - technique: towards breaking the curse of dimensionality,

S. Berchtold, C. B¨ ohm, and H.-P . Kriegal, “The pyramid - technique: towards breaking the curse of dimensionality,” in Proc. ACM SIGMOD Record, vol. 27, no. 2, 1998, pp. 142–153

work page 1998

[42] [42]

Optimal 3D beamfor m- ing using measured microphone directivity patterns,

M. Thomas, J. Ahrens, and I. Tashev, “Optimal 3D beamfor m- ing using measured microphone directivity patterns,” in Proc. IWAENC, 2012, pp. 1–4

work page 2012