Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion

Farnaz Jazaeri; Fran\c{c}ois Grondin; Homayoun Kamkar-Parsi; Martin Bouchard

arxiv: 2509.21382 · v2 · submitted 2025-09-23 · 📡 eess.AS · cs.SD

Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion

Farnaz Jazaeri , Homayoun Kamkar-Parsi , Fran\c{c}ois Grondin , Martin Bouchard This is my paper

Pith reviewed 2026-05-18 14:09 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords DOA estimationbinaural hearing aidsmulti-speakerdeep learningsource countingCRNNlate fusiondirection of arrival

0 comments

The pith

Accurate source count as auxiliary input boosts multi-speaker DOA estimation in binaural hearing aids.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how information about the number of active speakers can aid a deep learning model in estimating directions of arrival for multiple talkers using binaural hearing aid microphones. The authors compare joint training on both DOA and source counting against using source count as an extra feature fed into a CRNN at early, middle, or late stages. Joint training improves counting predictions but leaves DOA performance unchanged, while late fusion of perfect source count data raises average F1-scores by up to 14 percent over the baseline CRNN on real recordings. A sympathetic reader would care because reliable direction estimates let hearing aids better isolate a target voice in noisy, multi-speaker settings.

Core claim

A ground-truth source count used as an auxiliary feature significantly enhances standalone DOA estimation performance, with late fusion yielding up to 14% higher average F1-scores over the baseline CRNN.

What carries the argument

Late fusion of the number of active sources (0, 1, or 2+) as an auxiliary input into the CRNN architecture for multi-source DOA estimation.

If this is right

Dual-task training benefits source-count prediction but does not improve DOA estimation.
Late fusion of source count outperforms early and mid fusion strategies for DOA accuracy.
Source-count information can be used to make DOA estimation more robust in multi-speaker noisy environments.
Experiments on real binaural recordings confirm the gains from oracle source count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing the DOA model with a separate high-accuracy source counter could transfer these gains to real devices without oracle inputs.
The results imply that source count supplies useful context that helps the network resolve directional ambiguities among overlapping speakers.
Future tests could measure how much the 14% F1 gain shrinks as source-count error rate increases from zero.

Load-bearing premise

The reported gains assume perfect oracle source-count information is supplied to the DOA network.

What would settle it

Replace the oracle source count with outputs from an imperfect source counter and measure whether the F1-score gains over the baseline CRNN remain statistically significant on the same real binaural recordings.

Figures

Figures reproduced from arXiv: 2509.21382 by Farnaz Jazaeri, Fran\c{c}ois Grondin, Homayoun Kamkar-Parsi, Martin Bouchard.

read the original abstract

For extracting a target speaker voice, direction-of-arrival (DOA) estimation is crucial for binaural hearing aids operating in noisy, multi-speaker environments. Among the solutions developed for this task, a deep learning convolutional recurrent neural network (CRNN) model leveraging spectral phase differences and magnitude ratios between microphone signals is a popular option. In this paper, we explore adding source-count information for multi-sources DOA estimation. The use of dual-task training with joint multi-sources DOA estimation and source counting is first considered. We then consider using the source count as an auxiliary feature in a standalone DOA estimation system, where the number of active sources (0, 1, or 2+) is integrated into the CRNN architecture through early, mid, and late fusion strategies. Experiments using real binaural recordings are performed. Results show that the dual-task training does not improve DOA estimation performance, although it benefits source-count prediction. However, a ground-truth (oracle) source count used as an auxiliary feature significantly enhances standalone DOA estimation performance, with late fusion yielding up to 14% higher average F1-scores over the baseline CRNN. This highlights the potential of using source-count estimation for robust DOA estimation in binaural hearing aids.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Late fusion of oracle source count lifts multi-speaker DOA F1 by up to 14% on real binaural data, but the gains stay untested once count estimates carry realistic errors.

read the letter

The main takeaway is that feeding perfect source count as a late-fused feature into their binaural CRNN raises average F1 scores for multi-speaker DOA by as much as 14% over the plain baseline. Joint training for count and DOA helps the count task but leaves DOA performance unchanged. They run the tests on real binaural recordings, which keeps the results grounded rather than purely simulated. The comparison across early, mid, and late fusion stages is straightforward and shows late fusion working best in their setup. That part is a clean, incremental extension of existing fusion ideas to this hearing-aid context. The work earns credit for sticking to concrete numbers instead of vague claims. The clearest limitation is the oracle-count assumption. The reported lift depends on supplying error-free count information, yet the paper does not test what happens when that count comes from an imperfect estimator or when they feed their own dual-task count output back into the DOA branch. That gap matters for actual deployment. The abstract also leaves dataset size, cross-validation scheme, and significance testing unspecified, so the 14% figure is harder to weigh without the full experimental section. This paper is aimed at people working on deep-learning localization for hearing aids or other wearable audio devices. A reader who already uses CRNN baselines and wants to see how auxiliary count information can be added would pick up usable details from the fusion results. I would send it to peer review. The empirical comparison is solid enough to deserve referee input on the missing robustness checks and on whether the dataset details support the stated gains.

Referee Report

1 major / 2 minor

Summary. The manuscript explores enhancing multi-speaker DOA estimation for binaural hearing aids by incorporating source-count information (0, 1, or 2+) into a CRNN model that uses spectral phase differences and magnitude ratios. It first evaluates dual-task joint training for DOA and source counting, then examines early, mid, and late fusion of source count as an auxiliary feature in a standalone DOA system. Experiments on real binaural recordings show dual-task training improves counting but not DOA, while oracle source count with late fusion yields up to 14% higher average F1-scores over the baseline CRNN.

Significance. If the central result holds, the work demonstrates that auxiliary source-count information can meaningfully boost DOA performance in multi-speaker settings relevant to hearing aids. Credit is due for the use of real binaural recordings and for reporting concrete F1-score gains. The comparison of fusion stages provides useful architectural insight. However, the practical significance for deployment remains conditional on the untested robustness to source-count estimation errors.

major comments (1)

[Experiments and Results] The reported F1 improvements (up to 14% with late fusion) rely on supplying oracle/ground-truth source counts to the DOA network. No ablation or sensitivity analysis is provided that injects realistic count errors or substitutes the dual-task source-count output as input. This assumption is load-bearing for the claim that source-count fusion enhances standalone DOA estimation in binaural hearing aids, as real systems would use estimated counts whose error propagation is unquantified.

minor comments (2)

[Abstract] The abstract omits dataset size, number of recordings, cross-validation procedure, and any statistical significance tests for the F1 gains; adding these details would strengthen reproducibility claims.
[Methods] Notation for the fusion stages (early/mid/late) and how the source-count feature is encoded (one-hot or embedding) could be clarified with a diagram or explicit equations in the methods section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the value of our experiments on real binaural recordings and the architectural comparisons of fusion strategies. We address the major comment on the use of oracle source counts below and will incorporate additional analysis in the revised manuscript.

read point-by-point responses

Referee: The reported F1 improvements (up to 14% with late fusion) rely on supplying oracle/ground-truth source counts to the DOA network. No ablation or sensitivity analysis is provided that injects realistic count errors or substitutes the dual-task source-count output as input. This assumption is load-bearing for the claim that source-count fusion enhances standalone DOA estimation in binaural hearing aids, as real systems would use estimated counts whose error propagation is unquantified.

Authors: We agree that the robustness to source-count estimation errors is an important consideration for practical deployment in hearing aids. Our use of oracle counts was deliberate to quantify the upper-bound improvement achievable when perfect count information is available, thereby isolating the contribution of the fusion mechanism itself. The dual-task joint training was already evaluated in the manuscript; it improved source-count accuracy but produced no gain in DOA F1-scores relative to the baseline CRNN. In the revised manuscript we will add a dedicated sensitivity study: we will (i) inject controlled synthetic errors into the oracle count input (e.g., 10 % and 20 % random misclassification rates among the three classes) and report the resulting degradation in DOA F1-scores under late fusion, and (ii) substitute the actual output of the dual-task count predictor as the auxiliary input to the DOA network. These results will directly quantify error propagation and strengthen the practical relevance of the findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out data

full rationale

The paper presents standard supervised CRNN training for binaural DOA estimation, with ablation experiments comparing baseline, dual-task, and oracle-source-count fusion variants. Performance is quantified via F1-scores on real held-out binaural recordings; no equations, fitted parameters, or self-citations are shown to reduce the reported 14% F1 improvement to an input by construction. The central result is an empirical observation about oracle auxiliary features, not a derivation that collapses to its own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of deep learning for audio (features capture spatial cues, CRNN can learn temporal dependencies) plus the availability of oracle source count; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

fusion stage
Early, mid, and late fusion points are tested and late is selected for best reported performance.

axioms (1)

domain assumption Spectral phase differences and magnitude ratios between binaural channels are sufficient spatial features for DOA estimation
Invoked as the input representation for the baseline CRNN.

pith-pipeline@v0.9.0 · 5778 in / 1171 out tokens · 45287 ms · 2026-05-18T14:09:11.341012+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Direction-of-arrival (DOA) estimation enables the device to steer beamformers, suppress noise, and enhance conversa tional cues

INTRODUCTION For hearing-aid users, accurately localizing active speakers is essential for speech intelligibility and situational awareness. Direction-of-arrival (DOA) estimation enables the device to steer beamformers, suppress noise, and enhance conversa tional cues. However, real-world listening environments such as restaurants and meeting rooms pose ...

work page
[2]

' l '+' ,_, z z

METHODOLOGY 2.1. Problem Formulation We estimate direction-of-arrival (DOA) from a binaural 2- microphone behind-the-ear (BTE) binaural hearing-aid, us ing three microphone signals: front/rear microphones on the local device, and front microphone from the opposite-side de vice. Taking the view of the right-side device in Fig.l(a), multi-sources DOA esti...

work page
[3]

EXPERIMENTS 3.1. Datasets Training relied on synthetic mixtures generated by con volving TIMIT speech with head-related impulse responses (HRIRs) from multiple WS Audiology behind-the-ear (BTE) 2-microphone binaural hearing aid devices, measured in both anechoic and reverberant rooms (with RT60 reverberation time from 0.1 to 0.6 sec). Mixtures were produ...

work page
[4]

CONCLUSION This work evaluated the use of source-count information to potentially improve DOA estimation performance in binau ral hearing aids. Dual-task training for both DOA estimation and source counting did not benefit DOA estimation perfor mance, suggesting that multi-label DOA outputs already en codes implicit source count information. However, e...

work page
[5]

The generalized cor relation method for estimation of time delay,

C. H. Knapp and G. C. Carter, "The generalized cor relation method for estimation of time delay," IEEE Transactions on Acoustics, Speech, and Signal Process ing, vol. 24, no. 4, pp. 320-327, Aug. 1976

work page 1976
[6]

M. S. Brandstein and D. B. Ward, Eds., Microphone Arrays: Signal Processing Techniques and Applications, Springer, Berlin, Germany, 2001

work page 2001
[7]

A robust method to count and locate audio sources in a multi channel underdetermined mixture,

E. Arberet, R. Gribonval, and F. Bimbot, "A robust method to count and locate audio sources in a multi channel underdetermined mixture," IEEE Transactions on Signal Processing, vol. 58, no. 1, pp. 121-133, Jan. 2010. [ 4] C.R. Landschoot and N. Xiang, "Model-based bayesian direction of arrival analysis for sound sources using a spherical microphone array...

work page 2010
[8]

Localization of multiple speakers based on a two step acoustic map anal ysis,

A. Brutti, M. Omologo, and P. Svaizer, "Localization of multiple speakers based on a two step acoustic map anal ysis," in Proceedings of the IEEE International Confer ence on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, Apr. 2008, pp. 4349-4352

work page 2008
[9]

Multiple emitter location and signal pa rameter estimation,

R. Schmidt, "Multiple emitter location and signal pa rameter estimation," IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, Mar. 1986

work page 1986
[10]

A survey of sound source localization with deep learning methods,

P.-A. Grumiaux, S. Kitic, L. Girin, and A. Guerin, "A survey of sound source localization with deep learning methods," Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 107-151, July 2022

work page 2022
[11]

A learning-based approach to direction of ar rival estimation in noisy and reverberant environments,

X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, "A learning-based approach to direction of ar rival estimation in noisy and reverberant environments," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, Australia, Apr. 2015, pp. 2814-2818

work page 2015
[12]

Direction of arrival estimation of noisy speech using convolutional recurrent neural networks with higher order ambisonics signals,

N. Poschadel, R. Rupke, S. Preihs, and J. Peissig, "Direction of arrival estimation of noisy speech using convolutional recurrent neural networks with higher order ambisonics signals," in Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, Aug. 2021, pp. 211-215

work page 2021
[13]

Desai and N

D. Desai and N. Mehendale, ''A review on sound source localization systems," Archives of Computational Meth ods in Engineering, vol. 29, no. 7, pp. 4631-4642, May 2022

work page 2022
[14]

Deep learning ap proach in doa estimation: A systematic literature re view,

S. Ge, K. Li, and S.N.B.M. Rum, "Deep learning ap proach in doa estimation: A systematic literature re view," Mobile Information Systems, vol. 2021, pp. 1-14, Sept. 2021

work page 2021
[15]

Direction of arrival estimation for multiple sound sources using con volutional recurrent neural network,

S. Adavanne, A. Politis, and T. Virtanen, "Direction of arrival estimation for multiple sound sources using con volutional recurrent neural network," in Proceedings of the 26th European Signal Processing Conference (EU SIPCO), Sept. 2018, pp. 1462-1466

work page 2018
[16]

Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,

S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, "Sound event localization and detection of overlapping sources using convolutional recurrent neural networks," IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34-48, Mar. 2019

work page 2019
[17]

Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition,

AS. Subramanian, C. Weng, S. Watanabe, M. Yu, and D. Yu, "Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition," Computer Speech and Language, vol. 75, pp. 1-14, Feb. 2022

work page 2022
[18]

Multi-microphone simultaneous speakers detec tion and localization of multi-sources for separation and noise reduction,

A. Schwartz, 0. Schwartz, S.E. Chazan, and S. Gan not, "Multi-microphone simultaneous speakers detec tion and localization of multi-sources for separation and noise reduction," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, pp. 1-15, Oct. 2024

work page 2024
[19]

Enhancing direction-of-arrival estimation with multi-task learning,

S. Bianco, L. Celona, P. Crotti, P. Napoletano, G. Pe traglia, and P. Vinetti, "Enhancing direction-of-arrival estimation with multi-task learning," Sensors, vol. 24, no. 22, pp. 1-17, Nov. 2024

work page 2024
[20]

Robust source counting and doa estimation using spatial pseudo-spectrum and convolutional neural net work,

T.N.T. Nguyen, W.-S. Gan, R. Ranjan, and D.L. Jones, "Robust source counting and doa estimation using spatial pseudo-spectrum and convolutional neural net work," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2626-2637, Sept. 2020

work page 2020

[1] [1]

Direction-of-arrival (DOA) estimation enables the device to steer beamformers, suppress noise, and enhance conversa tional cues

INTRODUCTION For hearing-aid users, accurately localizing active speakers is essential for speech intelligibility and situational awareness. Direction-of-arrival (DOA) estimation enables the device to steer beamformers, suppress noise, and enhance conversa tional cues. However, real-world listening environments such as restaurants and meeting rooms pose ...

work page

[2] [2]

' l '+' ,_, z z

METHODOLOGY 2.1. Problem Formulation We estimate direction-of-arrival (DOA) from a binaural 2- microphone behind-the-ear (BTE) binaural hearing-aid, us ing three microphone signals: front/rear microphones on the local device, and front microphone from the opposite-side de vice. Taking the view of the right-side device in Fig.l(a), multi-sources DOA esti...

work page

[3] [3]

EXPERIMENTS 3.1. Datasets Training relied on synthetic mixtures generated by con volving TIMIT speech with head-related impulse responses (HRIRs) from multiple WS Audiology behind-the-ear (BTE) 2-microphone binaural hearing aid devices, measured in both anechoic and reverberant rooms (with RT60 reverberation time from 0.1 to 0.6 sec). Mixtures were produ...

work page

[4] [4]

CONCLUSION This work evaluated the use of source-count information to potentially improve DOA estimation performance in binau ral hearing aids. Dual-task training for both DOA estimation and source counting did not benefit DOA estimation perfor mance, suggesting that multi-label DOA outputs already en codes implicit source count information. However, e...

work page

[5] [5]

The generalized cor relation method for estimation of time delay,

C. H. Knapp and G. C. Carter, "The generalized cor relation method for estimation of time delay," IEEE Transactions on Acoustics, Speech, and Signal Process ing, vol. 24, no. 4, pp. 320-327, Aug. 1976

work page 1976

[6] [6]

M. S. Brandstein and D. B. Ward, Eds., Microphone Arrays: Signal Processing Techniques and Applications, Springer, Berlin, Germany, 2001

work page 2001

[7] [7]

A robust method to count and locate audio sources in a multi channel underdetermined mixture,

E. Arberet, R. Gribonval, and F. Bimbot, "A robust method to count and locate audio sources in a multi channel underdetermined mixture," IEEE Transactions on Signal Processing, vol. 58, no. 1, pp. 121-133, Jan. 2010. [ 4] C.R. Landschoot and N. Xiang, "Model-based bayesian direction of arrival analysis for sound sources using a spherical microphone array...

work page 2010

[8] [8]

Localization of multiple speakers based on a two step acoustic map anal ysis,

A. Brutti, M. Omologo, and P. Svaizer, "Localization of multiple speakers based on a two step acoustic map anal ysis," in Proceedings of the IEEE International Confer ence on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, Apr. 2008, pp. 4349-4352

work page 2008

[9] [9]

Multiple emitter location and signal pa rameter estimation,

R. Schmidt, "Multiple emitter location and signal pa rameter estimation," IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, Mar. 1986

work page 1986

[10] [10]

A survey of sound source localization with deep learning methods,

P.-A. Grumiaux, S. Kitic, L. Girin, and A. Guerin, "A survey of sound source localization with deep learning methods," Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 107-151, July 2022

work page 2022

[11] [11]

A learning-based approach to direction of ar rival estimation in noisy and reverberant environments,

X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, "A learning-based approach to direction of ar rival estimation in noisy and reverberant environments," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, Australia, Apr. 2015, pp. 2814-2818

work page 2015

[12] [12]

Direction of arrival estimation of noisy speech using convolutional recurrent neural networks with higher order ambisonics signals,

N. Poschadel, R. Rupke, S. Preihs, and J. Peissig, "Direction of arrival estimation of noisy speech using convolutional recurrent neural networks with higher order ambisonics signals," in Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, Aug. 2021, pp. 211-215

work page 2021

[13] [13]

Desai and N

D. Desai and N. Mehendale, ''A review on sound source localization systems," Archives of Computational Meth ods in Engineering, vol. 29, no. 7, pp. 4631-4642, May 2022

work page 2022

[14] [14]

Deep learning ap proach in doa estimation: A systematic literature re view,

S. Ge, K. Li, and S.N.B.M. Rum, "Deep learning ap proach in doa estimation: A systematic literature re view," Mobile Information Systems, vol. 2021, pp. 1-14, Sept. 2021

work page 2021

[15] [15]

Direction of arrival estimation for multiple sound sources using con volutional recurrent neural network,

S. Adavanne, A. Politis, and T. Virtanen, "Direction of arrival estimation for multiple sound sources using con volutional recurrent neural network," in Proceedings of the 26th European Signal Processing Conference (EU SIPCO), Sept. 2018, pp. 1462-1466

work page 2018

[16] [16]

Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,

S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, "Sound event localization and detection of overlapping sources using convolutional recurrent neural networks," IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34-48, Mar. 2019

work page 2019

[17] [17]

Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition,

AS. Subramanian, C. Weng, S. Watanabe, M. Yu, and D. Yu, "Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition," Computer Speech and Language, vol. 75, pp. 1-14, Feb. 2022

work page 2022

[18] [18]

Multi-microphone simultaneous speakers detec tion and localization of multi-sources for separation and noise reduction,

A. Schwartz, 0. Schwartz, S.E. Chazan, and S. Gan not, "Multi-microphone simultaneous speakers detec tion and localization of multi-sources for separation and noise reduction," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, pp. 1-15, Oct. 2024

work page 2024

[19] [19]

Enhancing direction-of-arrival estimation with multi-task learning,

S. Bianco, L. Celona, P. Crotti, P. Napoletano, G. Pe traglia, and P. Vinetti, "Enhancing direction-of-arrival estimation with multi-task learning," Sensors, vol. 24, no. 22, pp. 1-17, Nov. 2024

work page 2024

[20] [20]

Robust source counting and doa estimation using spatial pseudo-spectrum and convolutional neural net work,

T.N.T. Nguyen, W.-S. Gan, R. Ranjan, and D.L. Jones, "Robust source counting and doa estimation using spatial pseudo-spectrum and convolutional neural net work," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2626-2637, Sept. 2020

work page 2020