pith. sign in

arxiv: 2607.02296 · v1 · pith:5QVSSUKJnew · submitted 2026-07-02 · 📡 eess.AS

Spatial Speech Perception Systems: A Survey of Sound Source Localization, Directional Enhancement, and Speech Recognition

Pith reviewed 2026-07-03 04:50 UTC · model grok-4.3

classification 📡 eess.AS
keywords sound source localizationdirectional speech enhancementautomatic speech recognitionmicrophone arraysspatial audiorobot auditionspeech separation
0
0 comments X

The pith

Spatial speech perception systems integrate sound source localization, directional enhancement, and speech recognition to handle real-world noise and reverberation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews spatial speech perception systems by examining sound source localization, directional speech enhancement, and automatic speech recognition both as separate modules and within combined pipelines. It contrasts classical signal-processing techniques with recent learning-based methods for microphone-array processing, beamforming, neural enhancement, speech separation, and recognition. The work addresses practical issues including robustness under noise and reverberation, multi-speaker conditions, real-time operation, and computational demands. Representative applications in robot audition, hearing aids, smart speakers, and teleconferencing are outlined. Open challenges and directions for low-latency, perception-aware systems are identified.

Core claim

The paper claims that microphone-array information enables robust speech understanding in complex acoustic scenes by combining SSL for locating sources, DSE for enhancing target speech while suppressing interference, and ASR for interpretation, with both standalone and end-to-end pipeline approaches reviewed across classical and data-driven techniques.

What carries the argument

The integrated processing pipeline of sound source localization (SSL), directional speech enhancement (DSE), and automatic speech recognition (ASR) that exploits microphone-array data.

Load-bearing premise

The papers and methods chosen for review represent the current field without major omissions or biases in coverage of classical and learning-based approaches.

What would settle it

A search that identifies multiple significant recent papers or standard methods on SSL, DSE, or integrated ASR pipelines absent from the survey would indicate the review is incomplete.

Figures

Figures reproduced from arXiv: 2607.02296 by Dimitrios Kanoulas, Pengyuan Shao.

Figure 1
Figure 1. Figure 1: Overview of a spatial speech perception pipeline, illustrating the integration of sound source localization (SSL), directional speech enhancement [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of automatic speech recognition (ASR) architectures. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

Robust speech understanding in real-world acoustic environments remains a fundamental challenge for intelligent auditory systems such as robot audition, hearing aids, teleconferencing systems, smart speakers, and voice-controlled assistants. These systems must operate under background noise, reverberation, competing speakers, and dynamic acoustic conditions. Spatial speech perception addresses this challenge by exploiting microphone-array information to localize, enhance, and interpret target speech in complex acoustic scenes. This paper surveys spatial speech perception systems with emphasis on the roles of sound source localization (SSL), directional speech enhancement (DSE), and automatic speech recognition (ASR), both individually and within integrated processing pipelines. We review classical signal-processing approaches and recent learning-based methods for microphone-array localization, beamforming, neural enhancement, speech separation, and modern recognition architectures. Beyond component-level analysis, we discuss robustness to noise and reverberation, multi-speaker operation, real-time constraints, and computational efficiency. We also examine representative applications in robot audition, hearing assistance, smart speakers, and teleconferencing, and identify open challenges and future directions toward robust, low-latency, and perception-aware speech systems for complex acoustic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The paper surveys spatial speech perception systems, emphasizing the roles of sound source localization (SSL), directional speech enhancement (DSE), and automatic speech recognition (ASR) both as individual components and within integrated pipelines. It reviews classical signal-processing approaches alongside recent learning-based methods for microphone-array localization, beamforming, neural enhancement, speech separation, and modern recognition architectures, while addressing robustness to noise and reverberation, multi-speaker scenarios, real-time constraints, computational efficiency, representative applications (robot audition, hearing aids, smart speakers, teleconferencing), and open challenges for robust low-latency systems.

Significance. If the reviewed literature is representative, the survey would provide a useful consolidation of work on spatial audio processing pipelines, helping to map connections between SSL, DSE, and ASR and to highlight directions toward perception-aware systems; its value lies in the descriptive synthesis rather than new derivations or empirical results.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript and for recommending acceptance. Their summary correctly reflects the paper's focus on integrating SSL, DSE, and ASR within spatial speech perception pipelines.

Circularity Check

0 steps flagged

No significant circularity: survey of external literature

full rationale

This paper is a survey reviewing SSL, DSE, and ASR methods from external sources. It contains no derivations, equations, predictions, fitted parameters, or theorems whose validity depends on internal assumptions or self-citations. The central claim is descriptive (review of literature and pipelines), with no load-bearing steps that reduce to the paper's own inputs by construction. All content is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This survey paper does not introduce new parameters, axioms, or entities; its contribution is organizational review of existing methods rather than derivation or modeling.

pith-pipeline@v0.9.1-grok · 5732 in / 1045 out tokens · 26819 ms · 2026-07-03T04:50:40.881627+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

113 extracted references · 40 canonical work pages · 7 internal anchors

  1. [1]

    Some experiments on the recognition of speech, with one and with two ears,

    E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,”The Journal of the Acoustical Society of America, vol. 25, no. 5, pp. 975–979, 1953

  2. [2]

    A. S. Bregman,Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press, 05 1990. [Online]. Available: https: //doi.org/10.7551/mitpress/1486.001.0001

  3. [3]

    The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,

    A. W. Bronkhorst, “The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,”ACUS- TICA united with acta acustica, vol. 86, no. 1, pp. 117–128, 2000

  4. [4]

    Far-field automatic speech recognition,

    R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix, and T. Nakatani, “Far-field automatic speech recognition,”Proceedings of the IEEE, vol. 109, no. 2, pp. 124–148, 2021

  5. [5]

    The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

    J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ’chime’ speech separation and recognition challenge: Dataset, task and baselines,” 2018. [Online]. Available: https://arxiv.org/abs/1803.10609

  6. [6]

    Chime-6 challenge:tackling multispeaker speech recognition for unsegmented recordings,

    S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V . Manohar, D. Povey, D. Raj, D. Snyder, A. S. Subramanian, J. Trmal, B. B. Yair, C. Boeddeker, Z. Ni, Y . Fujita, S. Horiguchi, N. Kanda, T. Yoshioka, and N. Ryant, “Chime-6 challenge:tackling multispeaker speech recognition for unsegmented recordings,” 2020. [Online]. Avai...

  7. [7]

    Brandstein and H

    M. Brandstein and H. Silverman,Microphone Arrays: Signal Process- ing Techniques and Applications. Springer, 2001

  8. [8]

    Acoustic beamform- ing for hearing aid applications,

    S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “Acoustic beamform- ing for hearing aid applications,” inHandbook on Array Processing and Sensor Networks, S. Haykin and K. J. R. Liu, Eds. Wiley, 2010, pp. 269–302

  9. [9]

    Robot audition and computational auditory scene analysis,

    K. Nakadai and H. G. Okuno, “Robot audition and computational auditory scene analysis,”Advanced Intelligent Systems, vol. 2, no. 9, p. 2000050, 2020. [Online]. Available: https://advanced.onlinelibrary. wiley.com/doi/abs/10.1002/aisy.202000050

  10. [10]

    A survey of sound source localization with deep learning methods,

    P.-A. Grumiauxet al., “A survey of sound source localization with deep learning methods,”The Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 107–136, 2022. [Online]. Available: https://pubs.aip.org/asa/jasa/article/152/1/107/ 2838290/A-survey-of-sound-source-localization-with-deep

  11. [11]

    A survey of sound source localization and detection methods and their applications,

    G. Jekatery ´nczuket al., “A survey of sound source localization and detection methods and their applications,”Sensors, 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10781166/

  12. [12]

    Deep clustering: Discriminative embeddings for segmentation and separation

    J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” 2015. [Online]. Available: https://arxiv.org/abs/1508.04306

  13. [13]

    Deep attractor network for single-microphone speaker separation,

    Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Mar. 2017, p. 246–250. [Online]. Available: http://dx.doi.org/ 10.1109/ICASSP.2017.7952155

  14. [14]

    All neural low-latency directional speech extraction,

    A. Pandey, S. Lee, J. Azcarreta, D. Wong, and B. Xu, “All neural low-latency directional speech extraction,” 2024. [Online]. Available: https://arxiv.org/abs/2407.04879

  15. [15]

    Neural directed speech enhancement with dual microphone array in high noise scenario,

    W. Wen, Q. Zhou, Y . Xi, H. Li, Z. Gong, and K. Yu, “Neural directed speech enhancement with dual microphone array in high noise scenario,” 2024. [Online]. Available: https://arxiv.org/abs/2412.18141

  16. [16]

    End-to-end doa-guided speech extraction in noisy multi-talker scenarios,

    K. Jing, W. Zhang, and Y . Gao, “End-to-end doa-guided speech extraction in noisy multi-talker scenarios,” 2025. [Online]. Available: https://arxiv.org/abs/2507.20926

  17. [17]

    Automatic speech recognition: A survey of deep learning techniques and approaches,

    H. Ahlawat, N. Aggarwal, and D. Gupta, “Automatic speech recognition: A survey of deep learning techniques and approaches,” International Journal of Cognitive Computing in Engineering, vol. 6, pp. 201–237, 2025. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S2666307424000573

  18. [18]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–2...

  19. [19]

    Fastemit: Low- latency streaming asr with sequence-level emission regularization,

    J. Yu, C.-C. Chiu, B. Li, S. yiin Chang, T. N. Sainath, Y . He, A. Narayanan, W. Han, A. Gulati, Y . Wu, and R. Pang, “Fastemit: Low- latency streaming asr with sequence-level emission regularization,”

  20. [20]

    Available: https://arxiv.org/abs/2010.11148

    [Online]. Available: https://arxiv.org/abs/2010.11148

  21. [21]

    Performance and efficiency evaluation of asr inference on the edge,

    S. Gondi and V . Pratap, “Performance and efficiency evaluation of asr inference on the edge,”Sustainability, vol. 13, no. 22, 2021. [Online]. Available: https://www.mdpi.com/2071-1050/13/22/12392

  22. [22]

    The generalized correlation method for estimation of time delay,

    C. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976

  23. [23]

    Multiple emitter location and signal parameter estima- tion,

    R. O. Schmidt, “Multiple emitter location and signal parameter estima- tion,”IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276–280, 1986

  24. [24]

    High-resolution frequency-wavenumber spectrum analysis,

    J. Capon, “High-resolution frequency-wavenumber spectrum analysis,” Proceedings of the IEEE, vol. 57, no. 8, pp. 1408–1418, 1969

  25. [25]

    An alternative approach to linearly constrained adaptive beamforming,

    L. Griffiths and C. Jim, “An alternative approach to linearly constrained adaptive beamforming,”IEEE Transactions on Antennas and Propaga- tion, vol. 30, no. 1, pp. 27–34, 1982

  26. [26]

    Fasnet: Low- latency adaptive beamforming for multi-microphone audio processing,

    Y . Luo, E. Ceolini, C. Han, S.-C. Liu, and N. Mesgarani, “Fasnet: Low- latency adaptive beamforming for multi-microphone audio processing,”

  27. [27]

    Available: https://arxiv.org/abs/1909.13387

    [Online]. Available: https://arxiv.org/abs/1909.13387

  28. [28]

    A tutorial on hidden markov models and selected applica- tions in speech recognition,

    L. Rabiner, “A tutorial on hidden markov models and selected applica- tions in speech recognition,”Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989

  29. [29]

    Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,

    G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,”IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012

  30. [30]

    Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inProceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376

  31. [31]

    Sequence transduction with recurrent neural networks,

    A. Graves, “Sequence transduction with recurrent neural networks,”

  32. [32]

    Sequence Transduction with Recurrent Neural Networks

    [Online]. Available: https://arxiv.org/abs/1211.3711

  33. [33]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://pr...

  34. [34]

    Conformer: Convolution-augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” 2020. [Online]. Available: https://arxiv.org/abs/2005.08100

  35. [35]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevskiet al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” inNeurIPS, 2020. [Online]. Available: https://arxiv.org/abs/2006.11477

  36. [36]

    Development of microphone-array-embedded uav for search and rescue task,

    K. Nakadai, M. Kumon, H. G. Okuno, K. Hoshiba, M. Wakabayashi, K. Washizaki, T. Ishiki, D. Gabriel, Y . Bando, T. Morito, R. Kojima, and O. Sugiyama, “Development of microphone-array-embedded uav for search and rescue task,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 5985–5990

  37. [37]

    Odas: Open embedded audition system,

    F. Grondin, D. L ´etourneau, C. Godin, J.-S. Lauzon, J. Vincent, S. Michaud, S. Faucher, and F. Michaud, “Odas: Open embedded audition system,” 2022. [Online]. Available: https://arxiv.org/abs/2103. 03954

  38. [38]

    High-accuracy tdoa-based localization without time synchronization,

    B. Xu, G. Sun, R. Yu, and Z. Yang, “High-accuracy tdoa-based localization without time synchronization,”Parallel and Distributed Systems, IEEE Transactions on, vol. 24, pp. 1567–1576, 08 2013

  39. [39]

    Time delay estimation in the presence of cor- related noise and reverberation,

    Y . Rui and D. Florencio, “Time delay estimation in the presence of cor- related noise and reverberation,” inProceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2004

  40. [40]

    A robust method for speech signal time-delay estimation in the presence of reverberation and noise,

    H. F. Silverman, Y . Yu, J. Sachar, and W. Patterson, “A robust method for speech signal time-delay estimation in the presence of reverberation and noise,” inProceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1997

  41. [41]

    H. L. Van Trees,Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory. New York: Wiley, 2002

  42. [42]

    Ego-noise reduction of a mobile robot using noise spatial covariance matrix learning and minimum variance distortionless response,

    P.-O. Lagac ´e, F. Ferland, and F. Grondin, “Ego-noise reduction of a mobile robot using noise spatial covariance matrix learning and minimum variance distortionless response,” 2023. [Online]. Available: https://arxiv.org/abs/2303.00829

  43. [43]

    Sound source localization for human-robot interaction in outdoor environments,

    V . Liu, T. Du, J. Sehn, J. Collier, and F. Grondin, “Sound source localization for human-robot interaction in outdoor environments,”

  44. [44]

    Available: https://arxiv.org/abs/2507.21431

    [Online]. Available: https://arxiv.org/abs/2507.21431

  45. [45]

    Fast and robust 3-d sound source localiza- tion with dsvd-phat,

    F. Grondin and J. Glass, “Fast and robust 3-d sound source localiza- tion with dsvd-phat,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 5352–5357

  46. [46]

    Development of a high- precision multi-source localization system based on duet-srp-phat,

    Y . Jiang, R. Hang, B. Liu, S. Yang, and Y . Xu, “Development of a high- precision multi-source localization system based on duet-srp-phat,” in 2024 20th International Conference on Natural Computation, Fuzzy 26 Systems and Knowledge Discovery (ICNC-FSKD), 2024. [Online]. Available: https://doi.org/10.1109/icnc-fskd64080.2024.10702295

  47. [47]

    Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide- band sources,

    H. Wang and M. Kaveh, “Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide- band sources,”IEEE Transactions on Acoustics, Speech, and Signal Processing, 1985

  48. [48]

    Intelligent sound source localization for dynamic environments,

    K. Nakamura, K. Nakadai, F. Asano, Y . Hasegawa, and H. Tsujino, “Intelligent sound source localization for dynamic environments,” in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009, pp. 664–669

  49. [49]

    Real-time super-resolution sound source localization for robots,

    K. Nakamura, K. Nakadai, and G. Ince, “Real-time super-resolution sound source localization for robots,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 694–699

  50. [50]

    Improvement in outdoor sound source detection using a quadrotor- embedded microphone array,

    T. Ohata, K. Nakamura, T. Mizumoto, T. Taiki, and K. Nakadai, “Improvement in outdoor sound source detection using a quadrotor- embedded microphone array,” in2014 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems, 2014, pp. 1902–1907

  51. [51]

    Broadband doa estimation using convolutional neural networks trained with noise signals,

    S. Chakrabarty and E. A. P. Habets, “Broadband doa estimation using convolutional neural networks trained with noise signals,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, Oct. 2017, p. 136–140. [Online]. Available: http://dx.doi.org/10.1109/W ASPAA.2017.8170010

  52. [52]

    Deep neural networks for multiple speaker detection and localization,

    W. He, P. Motlicek, and J.-M. Odobez, “Deep neural networks for multiple speaker detection and localization,” in2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2018, p. 74–79. [Online]. Available: http://dx.doi.org/10.1109/ ICRA.2018.8461267

  53. [53]

    Sound source localization for auditory perception of a humanoid robot using deep neural networks,

    G. Boztas, “Sound source localization for auditory perception of a humanoid robot using deep neural networks,”Neural Computing and Applications, vol. 35, pp. 6801–6811, 2023, published 29 November 2022, Issue date March 2023. [Online]. Available: https://doi.org/10.1007/s00521-022-08047-x

  54. [54]

    Gcc-phat with speech- oriented attention for robotic sound source localization,

    J. Wang, X. Qian, Z. Pan, M. Zhang, and H. Li, “Gcc-phat with speech- oriented attention for robotic sound source localization,” in2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 5876–5883

  55. [56]
  56. [57]

    Sound event localization and detection using imbalanced real and synthetic data via multi-generator,

    Y . C. Shin and C. Chun, “Sound event localization and detection using imbalanced real and synthetic data via multi-generator,” Sensors (Basel, Switzerland), vol. 23, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257745516

  57. [58]

    The nerc-slip system for sound event localization and detection of dcase2022 challenge,

    Q. Wang, L. Chai, H. Wu, Z. Nian, S. Niu, S. Zheng, Y . Wang, L. Sun, Y . Fang, J. Pan, J. Du, and C.-H. Lee, “The nerc-slip system for sound event localization and detection of dcase2022 challenge,” DCASE2022 Challenge, Tech. Rep., June 2022, technical Report. [Online]. Available: https://dcase.community/documents/ challenge2022/technical reports/DCASE20...

  58. [59]

    Microphone pair training for robust sound source localization with diverse array configurations,

    I. An, G. An, T. Kim, and S.-e. Yoon, “Microphone pair training for robust sound source localization with diverse array configurations,” IEEE Robotics and Automation Letters, vol. 9, no. 1, pp. 319–326, 2024

  59. [60]

    Multiple sound sources localization using sub-band spatial features and attention mechanism,

    D. Zhang, J. Chen, J. Baiet al., “Multiple sound sources localization using sub-band spatial features and attention mechanism,”Circuits, Systems, and Signal Processing, vol. 44, pp. 2592–2620, 2025, published 13 December 2024, Issue date April 2025. [Online]. Available: https://doi.org/10.1007/s00034-024-02925-6

  60. [61]

    A hybrid CNN-LSTM model for environmental sound classification: Leveraging feature engineering and transfer learning,

    R. Akter, M. R. Islam, S. K. Debnath, P. K. Sarker, and M. K. Uddin, “A hybrid CNN-LSTM model for environmental sound classification: Leveraging feature engineering and transfer learning,”Digital Signal Processing, vol. 163, p. 105234, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1051200425002568

  61. [62]

    Single- microphone-based sound source localization for mobile robots in reverberant environments,

    J. Wang, R. Shi, B. Yen, H. Kong, and K. Nakadai, “Single- microphone-based sound source localization for mobile robots in reverberant environments,” 2025. [Online]. Available: https: //arxiv.org/abs/2506.16173

  62. [63]

    Ipdnet2: an efficient and improved inter-channel phase difference estimation network for sound source localization,

    Y . Wang, B. Yang, and X. Li, “Ipdnet2: an efficient and improved inter-channel phase difference estimation network for sound source localization,” 2025. [Online]. Available: https://arxiv.org/abs/2509. 21900

  63. [64]

    Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion

    F. Jazaeri, H. Kamkar-Parsi, F. Grondin, and M. Bouchard, “Multi-speaker doa estimation in binaural hearing aids using deep learning and speaker count fusion,” 2025. [Online]. Available: https://arxiv.org/abs/2509.21382

  64. [65]

    Auralnet: Hierarchical attention-based 3d binaural localization of overlapping speakers,

    L. Fu, Y . Liu, Z. Liu, Z. Yang, Z.-Q. Wang, Y . Li, and H. Kong, “Auralnet: Hierarchical attention-based 3d binaural localization of overlapping speakers,” 2025. [Online]. Available: https://arxiv.org/abs/ 2506.02773

  65. [66]

    Insights into deep non-linear filters for improved multi-channel speech enhancement,

    K. Tesch and T. Gerkmann, “Insights into deep non-linear filters for improved multi-channel speech enhancement,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, p. 563–575, 2023. [Online]. Available: http://dx.doi.org/10.1109/TASLP. 2022.3221046

  66. [67]

    Turning whisper into real-time transcription system,

    D. Mach ´aˇcek, R. Dabre, and O. Bojar, “Turning whisper into real-time transcription system,” inProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations, S. Saha and H. Sujaini, Eds. Bali, Indonesia: Asso...

  67. [68]

    Simultaneous translation with offline speech and LLM models in CUNI submission to IWSLT 2025,

    D. Mach ´aˇcek and P. Pol ´ak, “Simultaneous translation with offline speech and LLM models in CUNI submission to IWSLT 2025,” inProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), E. Salesky, M. Federico, and A. Anastasopoulos, Eds. Vienna, Austria (in-person and online): Association for Computational Linguistics...

  68. [69]

    data2vec: A general framework for self-supervised learning in speech, vision and language,

    A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” inProceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 16...

  69. [70]

    Long short-term memory re- current neural network architectures for large scale acoustic modeling

    H. Sak, A. W. Senior, F. Beaufayset al., “Long short-term memory re- current neural network architectures for large scale acoustic modeling.” inInterspeech, vol. 2014, 2014, pp. 338–342

  70. [71]

    Deep Speech: Scaling up end-to-end speech recognition

    A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep speech: Scaling up end-to-end speech recognition,” 2014. [Online]. Available: https://arxiv.org/abs/1412.5567

  71. [72]

    Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,

    K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,” in2017 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2017, pp. 193–199

  72. [73]

    Deep speech 2 : End-to-end speech recognition in english and mandarin,

    D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y ....

  73. [74]

    Framewise phoneme classification with bidirectional lstm and other neural network architectures,

    A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,”Neural networks, vol. 18, no. 5-6, pp. 602–610, 2005

  74. [75]

    Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

    W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964

  75. [76]

    Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,

    L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5884–5888

  76. [77]

    Msdet: Multitask speaker separation and direction-of-arrival estimation training,

    R. Hartanto, S. Sakti, and K. Shinoda, “Msdet: Multitask speaker separation and direction-of-arrival estimation training,” 09 2024, pp. 2170–2174

  77. [78]

    Dbnet: Doa-driven beamforming network for end-to-end reverberant sound source separation,

    A. Aroudi and S. Braun, “Dbnet: Doa-driven beamforming network for end-to-end reverberant sound source separation,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 211–215. 27

  78. [79]

    Jointnet: Joint learning for simultaneous doa estimation and speech enhancement in noisy and reverberant environments,

    W. Xiong, M. Jia, J. Zhou, J. Zhang, and Q. Shen, “Jointnet: Joint learning for simultaneous doa estimation and speech enhancement in noisy and reverberant environments,”IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 596–611, 2026

  79. [80]

    Directional asr: A new paradigm for e2e multi- speaker speech recognition with source localization,

    A. S. Subramanian, C. Weng, S. Watanabe, M. Yu, Y . Xu, S.-X. Zhang, and D. Yu, “Directional asr: A new paradigm for e2e multi- speaker speech recognition with source localization,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 8433–8437

  80. [81]

    Development of a low-latency and real-time automatic speech recognition system,

    C. S. Leow, T. Hayakawa, H. Nishizaki, and N. Kitaoka, “Development of a low-latency and real-time automatic speech recognition system,” in 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE), 2020, pp. 925–928

Showing first 80 references.