pith. sign in

arxiv: 2509.21382 · v2 · submitted 2025-09-23 · 📡 eess.AS · cs.SD

Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion

Pith reviewed 2026-05-18 14:09 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords DOA estimationbinaural hearing aidsmulti-speakerdeep learningsource countingCRNNlate fusiondirection of arrival
0
0 comments X

The pith

Accurate source count as auxiliary input boosts multi-speaker DOA estimation in binaural hearing aids.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how information about the number of active speakers can aid a deep learning model in estimating directions of arrival for multiple talkers using binaural hearing aid microphones. The authors compare joint training on both DOA and source counting against using source count as an extra feature fed into a CRNN at early, middle, or late stages. Joint training improves counting predictions but leaves DOA performance unchanged, while late fusion of perfect source count data raises average F1-scores by up to 14 percent over the baseline CRNN on real recordings. A sympathetic reader would care because reliable direction estimates let hearing aids better isolate a target voice in noisy, multi-speaker settings.

Core claim

A ground-truth source count used as an auxiliary feature significantly enhances standalone DOA estimation performance, with late fusion yielding up to 14% higher average F1-scores over the baseline CRNN.

What carries the argument

Late fusion of the number of active sources (0, 1, or 2+) as an auxiliary input into the CRNN architecture for multi-source DOA estimation.

If this is right

  • Dual-task training benefits source-count prediction but does not improve DOA estimation.
  • Late fusion of source count outperforms early and mid fusion strategies for DOA accuracy.
  • Source-count information can be used to make DOA estimation more robust in multi-speaker noisy environments.
  • Experiments on real binaural recordings confirm the gains from oracle source count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing the DOA model with a separate high-accuracy source counter could transfer these gains to real devices without oracle inputs.
  • The results imply that source count supplies useful context that helps the network resolve directional ambiguities among overlapping speakers.
  • Future tests could measure how much the 14% F1 gain shrinks as source-count error rate increases from zero.

Load-bearing premise

The reported gains assume perfect oracle source-count information is supplied to the DOA network.

What would settle it

Replace the oracle source count with outputs from an imperfect source counter and measure whether the F1-score gains over the baseline CRNN remain statistically significant on the same real binaural recordings.

Figures

Figures reproduced from arXiv: 2509.21382 by Farnaz Jazaeri, Fran\c{c}ois Grondin, Homayoun Kamkar-Parsi, Martin Bouchard.

Figure 1
Figure 1. Figure 1: Fig.1 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

For extracting a target speaker voice, direction-of-arrival (DOA) estimation is crucial for binaural hearing aids operating in noisy, multi-speaker environments. Among the solutions developed for this task, a deep learning convolutional recurrent neural network (CRNN) model leveraging spectral phase differences and magnitude ratios between microphone signals is a popular option. In this paper, we explore adding source-count information for multi-sources DOA estimation. The use of dual-task training with joint multi-sources DOA estimation and source counting is first considered. We then consider using the source count as an auxiliary feature in a standalone DOA estimation system, where the number of active sources (0, 1, or 2+) is integrated into the CRNN architecture through early, mid, and late fusion strategies. Experiments using real binaural recordings are performed. Results show that the dual-task training does not improve DOA estimation performance, although it benefits source-count prediction. However, a ground-truth (oracle) source count used as an auxiliary feature significantly enhances standalone DOA estimation performance, with late fusion yielding up to 14% higher average F1-scores over the baseline CRNN. This highlights the potential of using source-count estimation for robust DOA estimation in binaural hearing aids.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript explores enhancing multi-speaker DOA estimation for binaural hearing aids by incorporating source-count information (0, 1, or 2+) into a CRNN model that uses spectral phase differences and magnitude ratios. It first evaluates dual-task joint training for DOA and source counting, then examines early, mid, and late fusion of source count as an auxiliary feature in a standalone DOA system. Experiments on real binaural recordings show dual-task training improves counting but not DOA, while oracle source count with late fusion yields up to 14% higher average F1-scores over the baseline CRNN.

Significance. If the central result holds, the work demonstrates that auxiliary source-count information can meaningfully boost DOA performance in multi-speaker settings relevant to hearing aids. Credit is due for the use of real binaural recordings and for reporting concrete F1-score gains. The comparison of fusion stages provides useful architectural insight. However, the practical significance for deployment remains conditional on the untested robustness to source-count estimation errors.

major comments (1)
  1. [Experiments and Results] The reported F1 improvements (up to 14% with late fusion) rely on supplying oracle/ground-truth source counts to the DOA network. No ablation or sensitivity analysis is provided that injects realistic count errors or substitutes the dual-task source-count output as input. This assumption is load-bearing for the claim that source-count fusion enhances standalone DOA estimation in binaural hearing aids, as real systems would use estimated counts whose error propagation is unquantified.
minor comments (2)
  1. [Abstract] The abstract omits dataset size, number of recordings, cross-validation procedure, and any statistical significance tests for the F1 gains; adding these details would strengthen reproducibility claims.
  2. [Methods] Notation for the fusion stages (early/mid/late) and how the source-count feature is encoded (one-hot or embedding) could be clarified with a diagram or explicit equations in the methods section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the value of our experiments on real binaural recordings and the architectural comparisons of fusion strategies. We address the major comment on the use of oracle source counts below and will incorporate additional analysis in the revised manuscript.

read point-by-point responses
  1. Referee: The reported F1 improvements (up to 14% with late fusion) rely on supplying oracle/ground-truth source counts to the DOA network. No ablation or sensitivity analysis is provided that injects realistic count errors or substitutes the dual-task source-count output as input. This assumption is load-bearing for the claim that source-count fusion enhances standalone DOA estimation in binaural hearing aids, as real systems would use estimated counts whose error propagation is unquantified.

    Authors: We agree that the robustness to source-count estimation errors is an important consideration for practical deployment in hearing aids. Our use of oracle counts was deliberate to quantify the upper-bound improvement achievable when perfect count information is available, thereby isolating the contribution of the fusion mechanism itself. The dual-task joint training was already evaluated in the manuscript; it improved source-count accuracy but produced no gain in DOA F1-scores relative to the baseline CRNN. In the revised manuscript we will add a dedicated sensitivity study: we will (i) inject controlled synthetic errors into the oracle count input (e.g., 10 % and 20 % random misclassification rates among the three classes) and report the resulting degradation in DOA F1-scores under late fusion, and (ii) substitute the actual output of the dual-task count predictor as the auxiliary input to the DOA network. These results will directly quantify error propagation and strengthen the practical relevance of the findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out data

full rationale

The paper presents standard supervised CRNN training for binaural DOA estimation, with ablation experiments comparing baseline, dual-task, and oracle-source-count fusion variants. Performance is quantified via F1-scores on real held-out binaural recordings; no equations, fitted parameters, or self-citations are shown to reduce the reported 14% F1 improvement to an input by construction. The central result is an empirical observation about oracle auxiliary features, not a derivation that collapses to its own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of deep learning for audio (features capture spatial cues, CRNN can learn temporal dependencies) plus the availability of oracle source count; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • fusion stage
    Early, mid, and late fusion points are tested and late is selected for best reported performance.
axioms (1)
  • domain assumption Spectral phase differences and magnitude ratios between binaural channels are sufficient spatial features for DOA estimation
    Invoked as the input representation for the baseline CRNN.

pith-pipeline@v0.9.0 · 5778 in / 1171 out tokens · 45287 ms · 2026-05-18T14:09:11.341012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Direction-of-arrival (DOA) estimation enables the device to steer beamformers, suppress noise, and enhance conversa­ tional cues

    INTRODUCTION For hearing-aid users, accurately localizing active speakers is essential for speech intelligibility and situational awareness. Direction-of-arrival (DOA) estimation enables the device to steer beamformers, suppress noise, and enhance conversa­ tional cues. However, real-world listening environments such as restaurants and meeting rooms pose ...

  2. [2]

    ' l '+' ,_, z z

    METHODOLOGY 2.1. Problem Formulation We estimate direction-of-arrival (DOA) from a binaural 2- microphone behind-the-ear (BTE) binaural hearing-aid, us­ ing three microphone signals: front/rear microphones on the local device, and front microphone from the opposite-side de­ vice. Taking the view of the right-side device in Fig.l(a), multi-sources DOA esti...

  3. [3]

    EXPERIMENTS 3.1. Datasets Training relied on synthetic mixtures generated by con­ volving TIMIT speech with head-related impulse responses (HRIRs) from multiple WS Audiology behind-the-ear (BTE) 2-microphone binaural hearing aid devices, measured in both anechoic and reverberant rooms (with RT60 reverberation time from 0.1 to 0.6 sec). Mixtures were produ...

  4. [4]

    CONCLUSION This work evaluated the use of source-count information to potentially improve DOA estimation performance in binau­ ral hearing aids. Dual-task training for both DOA estimation and source counting did not benefit DOA estimation perfor­ mance, suggesting that multi-label DOA outputs already en­ codes implicit source count information. However, e...

  5. [5]

    The generalized cor­ relation method for estimation of time delay,

    C. H. Knapp and G. C. Carter, "The generalized cor­ relation method for estimation of time delay," IEEE Transactions on Acoustics, Speech, and Signal Process­ ing, vol. 24, no. 4, pp. 320-327, Aug. 1976

  6. [6]

    M. S. Brandstein and D. B. Ward, Eds., Microphone Arrays: Signal Processing Techniques and Applications, Springer, Berlin, Germany, 2001

  7. [7]

    A robust method to count and locate audio sources in a multi­ channel underdetermined mixture,

    E. Arberet, R. Gribonval, and F. Bimbot, "A robust method to count and locate audio sources in a multi­ channel underdetermined mixture," IEEE Transactions on Signal Processing, vol. 58, no. 1, pp. 121-133, Jan. 2010. [ 4] C.R. Landschoot and N. Xiang, "Model-based bayesian direction of arrival analysis for sound sources using a spherical microphone array...

  8. [8]

    Localization of multiple speakers based on a two step acoustic map anal­ ysis,

    A. Brutti, M. Omologo, and P. Svaizer, "Localization of multiple speakers based on a two step acoustic map anal­ ysis," in Proceedings of the IEEE International Confer­ ence on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, Apr. 2008, pp. 4349-4352

  9. [9]

    Multiple emitter location and signal pa­ rameter estimation,

    R. Schmidt, "Multiple emitter location and signal pa­ rameter estimation," IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, Mar. 1986

  10. [10]

    A survey of sound source localization with deep learning methods,

    P.-A. Grumiaux, S. Kitic, L. Girin, and A. Guerin, "A survey of sound source localization with deep learning methods," Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 107-151, July 2022

  11. [11]

    A learning-based approach to direction of ar­ rival estimation in noisy and reverberant environments,

    X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, "A learning-based approach to direction of ar­ rival estimation in noisy and reverberant environments," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, Australia, Apr. 2015, pp. 2814-2818

  12. [12]

    Direction of arrival estimation of noisy speech using convolutional recurrent neural networks with higher­ order ambisonics signals,

    N. Poschadel, R. Rupke, S. Preihs, and J. Peissig, "Direction of arrival estimation of noisy speech using convolutional recurrent neural networks with higher­ order ambisonics signals," in Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, Aug. 2021, pp. 211-215

  13. [13]

    Desai and N

    D. Desai and N. Mehendale, ''A review on sound source localization systems," Archives of Computational Meth­ ods in Engineering, vol. 29, no. 7, pp. 4631-4642, May 2022

  14. [14]

    Deep learning ap­ proach in doa estimation: A systematic literature re­ view,

    S. Ge, K. Li, and S.N.B.M. Rum, "Deep learning ap­ proach in doa estimation: A systematic literature re­ view," Mobile Information Systems, vol. 2021, pp. 1-14, Sept. 2021

  15. [15]

    Direction of arrival estimation for multiple sound sources using con­ volutional recurrent neural network,

    S. Adavanne, A. Politis, and T. Virtanen, "Direction of arrival estimation for multiple sound sources using con­ volutional recurrent neural network," in Proceedings of the 26th European Signal Processing Conference (EU­ SIPCO), Sept. 2018, pp. 1462-1466

  16. [16]

    Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,

    S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, "Sound event localization and detection of overlapping sources using convolutional recurrent neural networks," IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34-48, Mar. 2019

  17. [17]

    Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition,

    AS. Subramanian, C. Weng, S. Watanabe, M. Yu, and D. Yu, "Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition," Computer Speech and Language, vol. 75, pp. 1-14, Feb. 2022

  18. [18]

    Multi-microphone simultaneous speakers detec­ tion and localization of multi-sources for separation and noise reduction,

    A. Schwartz, 0. Schwartz, S.E. Chazan, and S. Gan­ not, "Multi-microphone simultaneous speakers detec­ tion and localization of multi-sources for separation and noise reduction," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, pp. 1-15, Oct. 2024

  19. [19]

    Enhancing direction-of-arrival estimation with multi-task learning,

    S. Bianco, L. Celona, P. Crotti, P. Napoletano, G. Pe­ traglia, and P. Vinetti, "Enhancing direction-of-arrival estimation with multi-task learning," Sensors, vol. 24, no. 22, pp. 1-17, Nov. 2024

  20. [20]

    Robust source counting and doa estimation using spatial pseudo-spectrum and convolutional neural net­ work,

    T.N.T. Nguyen, W.-S. Gan, R. Ranjan, and D.L. Jones, "Robust source counting and doa estimation using spatial pseudo-spectrum and convolutional neural net­ work," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2626-2637, Sept. 2020