arxiv: 2604.23927 · v2 · submitted 2026-04-27 · 💻 cs.HC

Recognition: no theorem link

Towards Localizing Conversation Partners using Head Motion

Payal Mohapatra , Calvin Murdock , Ali Aroudi , Ishwarya Ananthabhotla , Anjali Menon , Buye Xu , Morteza Khaleghimeybodi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:43 UTC · model grok-4.3

classification 💻 cs.HC

keywords head motionIMUsmartglassesacoustic zonesconversation partnersspeech enhancementwearable localization

0 comments

The pith

Head motion from smartglasses IMUs localizes acoustic zones of interest for conversation partners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that head-orienting behavior captured by inertial measurement units on smartglasses can infer the directions a user wants to listen during seated conversations. This matters because existing spatial audio methods ignore user preferences and struggle when background speakers create interference. The authors introduce HALo, a network that maps IMU data to these acoustic zones and gains 21 percent better performance when given an accurate count of conversation partners in advance. They add CoCo, a classifier that determines the number of partners from the same motion signals alone at 0.74 accuracy. The work also includes a minimal speech enhancement demonstration that benefits from this localization in high-noise multi-speaker conditions.

Core claim

HALo processes IMU time series from smartglasses to predict acoustic zones of interest that correspond to conversation partner locations; supplying it with a prior estimate of partner count produces a 21 percent performance lift over baseline methods, while the companion CoCo classifier recovers that partner count from IMU data at 0.74 accuracy and a 35 percent gain over rule-based and generic time-series baselines.

What carries the argument

HALo, the head-orientation-based acoustic zone localization network that converts smartglasses IMU signals into estimates of user listening directions.

If this is right

Enables non-invasive inference of listening directions that does not depend on microphone arrays for direction-of-arrival estimation.
Delivers a 35 percent improvement in classifying the number of conversation partners over rule-based and generic time-series baselines.
Supports end-to-end speech enhancement pipelines that maintain performance when multiple background speakers are present.
Provides feature-extraction and inference steps suitable for real-time wearable deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The stability observed over extended sessions suggests the approach could support continuous, day-long use without frequent recalibration.
Integration with other wearable signals might extend reliable zone prediction beyond seated, controlled conversations.
The same motion patterns could be repurposed for related attention tasks such as selective audio focus in meetings or classrooms.

Load-bearing premise

Head movements tracked by IMUs accurately reflect the directions users intend to listen toward, and an accurate prior count of conversation partners is available.

What would settle it

Controlled recordings in which users' measured head angles during conversation deviate from their self-reported listening directions, or in which supplying an incorrect partner count removes the reported accuracy gains.

Figures

Figures reproduced from arXiv: 2604.23927 by Ali Aroudi, Anjali Menon, Buye Xu, Calvin Murdock, Ishwarya Ananthabhotla, Morteza Khaleghimeybodi, Payal Mohapatra.

**Figure 1.** Figure 1: Concept: Using IMU sensors on smartglasses to extract the head-orienting patterns of the user during a conversational setting to identify the user’s acoustic zones of interest (which are indicative of the conversation partners’ locations). Potential use cases: Consider a large multiparty noisy setting (e.g., a restaurant) with interfering/background speakers. The user (shown wearing smartglasses) engages w… view at source ↗

**Figure 2.** Figure 2: Current smartglasses technologies (Left) view at source ↗

**Figure 3.** Figure 3: Illustration of a representative case a) showing drift that increases towards the end of the measurement window and view at source ↗

**Figure 4.** Figure 4: Study Motivation Illustration: (a) Layout of the conversation group, showing the focal user and four conversation view at source ↗

**Figure 5.** Figure 5: Spatial discretization of conversation partner locations. For the layout in view at source ↗

**Figure 6.** Figure 6: Overview of the HALo network used for localizing the acoustic zones of interest: Temporal learning module, followed by the fusion block for incorporating static features, and finally the imbalanced predictors for determining the focal user’s acoustic zones of interest. Task Formulation 1: Head-orientation based Acoustic Zones of Interest Localization (HALo) Given a sequence of head orientation measurements… view at source ↗

**Figure 7.** Figure 7: Illustration of the classification network to determine the number of conversation partners based on head-orientation view at source ↗

**Figure 9.** Figure 9: Distribution of the number of conversation part view at source ↗

**Figure 10.** Figure 10: Distribution of group sizes across all sessions. view at source ↗

**Figure 12.** Figure 12: Performance of the rule-based method: spatial density-based clustering of the focal user’s head orientation during their non-speaking state within a segment, demonstrated for a representative setting (same scene as shown in view at source ↗

**Figure 13.** Figure 13: Performance of conversation partners’ localization task view at source ↗

**Figure 14.** Figure 14: Illustration of the model’s temporal self-attention mechanism for predicting conversation partners’ locations based view at source ↗

**Figure 15.** Figure 15: (a, b) Sandwiched Mispredictions; (c, d) Shifted prediction for a single conversation partner; (e, f ) Shifted prediction view at source ↗

**Figure 17.** Figure 17: Performance of classifying the number of view at source ↗

**Figure 16.** Figure 16: Performance metrics for identifying the number of conversation partners. (a) and (b) use abstract audio features, view at source ↗

**Figure 19.** Figure 19: Bland–Altman analysis for the complete dataset, view at source ↗

**Figure 20.** Figure 20: Illustration of the F1-score and logit-wise accuracy for different spatial discretizations: (a) 3 spatial zones, (b) 6 spatial view at source ↗

**Figure 21.** Figure 21: Comparison of IMU-only model with original targets versus proposed enhancements—(1) abstract audio features and (2) cumulative voice-activity-based target shaping—across increasing group sizes in multiparty conversations. Impact of increasing group size with different conditioning on CoCo. We analyze the performance of CoCo with increasing group sizes, which overall exhibits a decreasing trend due to in… view at source ↗

**Figure 22.** Figure 22: Our proposed world-locked partner localiza view at source ↗

**Figure 23.** Figure 23: Comparing head-orientation (front facing view at source ↗

**Figure 24.** Figure 24: (a) Represents the RLR-Chat dataset organization where view at source ↗

read the original abstract

Many individuals struggle to understand conversation partners in noisy settings, particularly amid background speakers or due to hearing impairments. Emerging wearables like smartglasses offer a transformative opportunity to enhance speech from conversation partners. Crucial to this is identifying the direction in which the user wants to listen, which we refer to as the user's acoustic zones of interest. While current spatial audio-based methods can resolve the direction of vocal input, they are agnostic to listening preferences and have limited functionality in noisy settings with interfering speakers. To address this, behavioral cues are needed to actively infer a user's acoustic zones of interest. We explore the effectiveness of head-orienting behavior, captured by Inertial Measurement Units (IMUs) on smartglasses, as a modality for localizing these zones in seated conversations. We introduce HALo, a head-orientation-based acoustic zone localization network that leverages smartglasses' IMUs to non-invasively infer auditory zones of interest corresponding to conversation partner locations. By integrating an a priori estimate of the number of conversation partners, our approach yields a 21% performance improvement over existing methods. We complement this with CoCo, which classifies the number of conversation partners using only IMU data, achieving 0.74 accuracy and a 35% gain over rule-based and generic time-series baselines. We discuss practical considerations for feature extraction and inference and provide qualitative analyses over extended sessions. We also demonstrate a minimal end-to-end speech enhancement system, showing that head-orientation-based localization offers clear advantages in extremely noisy settings with multiple conversation partners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Head motion from smartglasses IMUs can localize conversation partners for audio enhancement, but the 21% gain likely assumes perfect knowledge of partner count rather than CoCo's output.

read the letter

The main thing to take away is that head motion captured by IMUs on smartglasses can help identify which conversation partner the user is focusing on in noisy environments, delivering a 21% boost when the number of partners is known upfront, along with a CoCo model that guesses that number from IMU data alone at 74% accuracy. This is new in the way it combines behavioral head orientation with acoustic zone localization for wearables, moving beyond pure spatial audio methods that ignore user preferences and falter with background talkers. The authors build HALo for the localization task and show it improves on existing approaches with the partner count input. CoCo adds value by classifying the count using only the motion signals, outperforming rule-based and generic time-series methods. They also include a simple end-to-end speech enhancement demo and discuss feature extraction for practical use, plus some qualitative analysis on extended sessions. The approach is grounded in a real problem for hearing-impaired users and leverages hardware that's already in smartglasses. The soft spots are around the performance claims. The 21% improvement relies on an a priori estimate of partner count, which the paper presents separately from CoCo. Without an ablation showing results when using CoCo's 0.74-accurate predictions instead of perfect knowledge, it's hard to judge the full pipeline gain from IMU data only. The abstract lacks specifics on dataset size, number of participants, baselines, and statistical significance, making it tough to assess without the full details. The assumption that head turns reliably signal interest zones seems reasonable for seated talks but could be sensitive to other movements or settings. This paper targets researchers in human-computer interaction, wearable sensing, and audio enhancement for assistive devices. A colleague working on multimodal signals for noisy conversations would find it worth reading and potentially building on. It shows clear thinking on the problem and has enough substance to deserve a serious referee. I would recommend sending it for peer review, asking the authors to clarify the end-to-end performance with predicted counts.

Referee Report

2 major / 2 minor

Summary. The paper proposes HALo, an IMU-based neural network on smartglasses that localizes users' acoustic zones of interest during seated conversations by modeling head-orientation behavior. It reports a 21% performance gain when an a priori estimate of the number of conversation partners is supplied. A companion classifier CoCo predicts the partner count from IMU data alone at 0.74 accuracy (35% above rule-based and generic time-series baselines). The work includes qualitative analysis over long sessions and a minimal end-to-end speech-enhancement demonstration.

Significance. If the empirical claims are substantiated, the approach offers a practical, non-invasive way to steer spatial audio on commodity wearables toward conversation partners in noisy multi-speaker settings, directly addressing a common pain point for hearing-impaired users. The reliance on head motion rather than acoustic source localization is a useful complementary signal, and the provision of both localization and partner-count modules is a coherent pipeline contribution.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation: the headline 21% improvement for HALo is stated to result from 'integrating an a priori estimate of the number of conversation partners,' yet the manuscript presents CoCo (0.74 accuracy) separately. No ablation is reported that substitutes CoCo's predictions for the oracle count inside HALo, so it is impossible to determine the end-to-end gain achievable when only IMU data are available.
[Experimental Setup] Experimental Setup: the manuscript provides no information on dataset size, number of participants, recording conditions, exact baseline implementations, or the statistical tests used to support the 21% and 35% figures. These omissions prevent assessment of whether the reported gains are robust or merely artifacts of small or unrepresentative data.

minor comments (2)

[Method] Notation for acoustic zones of interest is introduced without a clear diagram relating IMU axes to angular zones; a figure would improve readability.
[Method] The description of feature extraction for CoCo and HALo could be expanded with pseudocode or explicit window lengths to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: Abstract and Evaluation: the headline 21% improvement for HALo is stated to result from 'integrating an a priori estimate of the number of conversation partners,' yet the manuscript presents CoCo (0.74 accuracy) separately. No ablation is reported that substitutes CoCo's predictions for the oracle count inside HALo, so it is impossible to determine the end-to-end gain achievable when only IMU data are available.

Authors: We agree with this observation. The 21% improvement highlights the value of providing the partner count as input to HALo, while CoCo serves as a standalone classifier for estimating this count from IMU data. To address the lack of end-to-end evaluation, we will add an ablation study in the revised manuscript that uses CoCo's predictions as input to HALo and reports the resulting localization performance. This will clarify the practical gains when relying solely on IMU data. revision: yes
Referee: Experimental Setup: the manuscript provides no information on dataset size, number of participants, recording conditions, exact baseline implementations, or the statistical tests used to support the 21% and 35% figures. These omissions prevent assessment of whether the reported gains are robust or merely artifacts of small or unrepresentative data.

Authors: We apologize for these omissions in the description of the experimental setup. In the revised version, we will expand the Experimental Setup section to include details on the dataset size, number of participants, recording conditions (e.g., seated conversations in controlled environments), the precise implementations of the rule-based and time-series baselines, and the statistical tests (such as significance testing for the reported percentage improvements) used to validate the results. This will allow readers to better assess the robustness of our findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external a priori input and new IMU measurements

full rationale

The paper presents HALo as integrating an external a priori estimate of conversation partner count to obtain the reported 21% gain, while separately introducing CoCo as an IMU-based classifier achieving 0.74 accuracy. No equation or step reduces a prediction to a fitted parameter by construction, no self-citation chain bears the central claim, and no uniqueness theorem or ansatz is smuggled in. The approach is benchmarked against rule-based and time-series baselines using new sensor data, satisfying the default non-circularity expectation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claims rest on the unverified assumption that head motion correlates with listening intent.

axioms (1)

domain assumption Head orientation captured by IMUs on smartglasses indicates the user's acoustic zones of interest in seated conversations
This is the core premise enabling HALo and CoCo

pith-pipeline@v0.9.0 · 5603 in / 1232 out tokens · 28508 ms · 2026-05-15T06:43:23.252858+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · 2 internal anchors

[1]

Mustafa Al-Yassary, Kelly Billiaert, Gregory S Antonarakis, and Stavros Kiliaridis. 2021. Evaluation of head posture using an inertial measurement unit.Scientific reports11, 1 (2021), 19911

work page 2021
[2]

Xavier Alameda-Pineda, Jacopo Staiano, Ramanathan Subramanian, Ligia Batrinca, Elisa Ricci, Bruno Lepri, Oswald Lanz, and Nicu Sebe. 2015. Salsa: A novel dataset for multimodal group behavior analysis.IEEE transactions on pattern analysis and machine intelligence38, 8 (2015), 1707–1720

work page 2015
[3]

Sileye O Ba and Jean-Marc Odobez. 2008. Recognizing visual focus of attention from head pose in natural meetings.IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)39, 1 (2008), 16–33

work page 2008
[4]

Shanmukha Srinivas Battula, Hassan Taherian, Ashutosh Pandey, Daniel Wong, Buye Xu, and DeLiang Wang. 2025. Robust Frame-level Speaker Localization in Reverberant and Noisy Environments by Exploiting Phase Difference Losses. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025
[5]

Igor Bisio, Chiara Garibotto, Mehrnaz Hamedani, Fabio Lavagetto, Angelo Schenone, Andrea Sciarrone, and Muhammad Shahid. 2024. Towards Sensorized Glasses: A Smart Wearable System for Head Movement Monitoring. In2024 9th International Conference on Smart and Sustainable Technologies (SpliTech). IEEE, 1–6

work page 2024
[6]

JM Bland. 1986. Statistical methods for assessing agreement between two methods of clinical measurement.Lancet(1986)

work page 1986
[7]

W Owen Brimijoin, David McShefferty, and Michael A Akeroyd. 2010. Auditory and visual orienting responses in listeners with and without hearing-impairment.The Journal of the Acoustical Society of America127, 6 (2010), 3678–3688

work page 2010
[8]

Carlos Busso, Sergi Hernanz, Chi-Wei Chu, Soon-il Kwon, Sung Lee, Panayiotis G Georgiou, Isaac Cohen, and Shrikanth Narayanan. 2005. Smart room: Participant and speaker localization and identification. InProceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., Vol. 2. IEEE, ii–1117

work page 2005
[9]

Laura Cabrera-Quiros, Andrew Demetriou, Ekin Gedik, Leander van der Meij, and Hayley Hung. 2018. The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates. IEEE Transactions on Affective Computing12, 1 (2018), 113–130

work page 2018
[10]

Ciro Cattuto, Wouter Van den Broeck, Alain Barrat, Vittoria Colizza, Jean-François Pinton, and Alessandro Vespignani. 2010. Dynamics of person-to-person interactions from distributed RFID sensor networks.PloS one5, 7 (2010), e11596

work page 2010
[11]

Ishan Chatterjee, Maruchi Kim, Vivek Jayaram, Shyamnath Gollakota, Ira Kemelmacher, Shwetak Patel, and Steven M Seitz. 2022. ClearBuds: wireless binaural earbuds for learning-based speech enhancement. InProceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 384–396

work page 2022
[12]

Tuochao Chen, Malek Itani, Sefik Emre Eskimez, Takuya Yoshioka, and Shyamnath Gollakota. 2024. Hearable devices with sound bubbles. Nature Electronics(2024), 1–12

work page 2024
[13]

Jong-Suk Choi, Munsang Kim, and Hyun-Don Kim. 2006. Probabilistic speaker localization in noisy environments by audio-visual integration. In2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 4704–4709

work page 2006
[14]

Jessica Dawson and Tom Foulsham. 2022. Your turn to speak? Audiovisual social attention in the lab and in the wild.Visual Cognition30, 1-2 (2022), 116–134

work page 2022
[15]

Hoang Do, Harvey F Silverman, and Ying Yu. 2007. A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array. In2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Vol. 1. IEEE, I–121

work page 2007
[16]

Xuan Dong and Donald S Williamson. 2019. A classification-aided framework for non-intrusive speech quality assessment. In2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (W ASPAA). IEEE, 100–104

work page 2019
[17]

Jacob Donley, Vladimir Tourbabin, Jung-Suk Lee, Mark Broyles, Hao Jiang, Jie Shen, Maja Pantic, Vamsi Krishna Ithapu, and Ravish Mehra. 2021. Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments.arXiv preprint arXiv:2107.04174 (2021). Towards Localizing Conversation Partners using Head Motion•25

work page arXiv 2021
[18]

Starkey Duncan. 1972. Some signals and rules for taking speaking turns in conversations.Journal of personality and social psychology23, 2 (1972), 283

work page 1972
[19]

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. 2023. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561(2023)

work page internal anchor Pith review arXiv 2023
[20]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al . 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, Vol. 96. 226–231

work page 1996
[21]

Rebecca C Felsheim, Andreas Brendel, Patrick A Naylor, and Walter Kellermann. 2021. Head orientation estimation from multiple microphone arrays. In2020 28th European Signal Processing Conference (EUSIPCO). IEEE, 491–495

work page 2021
[22]

Tiantian Feng, Ju Lin, Yiteng Huang, Weipeng He, Kaustubh Kalgaonkar, Niko Moritz, Li Wan, Xin Lei, Ming Sun, and Frank Seide. 2025. Directional source separation for robust speech recognition on smart glasses. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025
[23]

Andrea Ferlini, Alessandro Montanari, Cecilia Mascolo, and Robert Harle. 2019. Head motion tracking through in-ear wearables. InProceedings of the 1st International Workshop on Earable Computing. 8–13

work page 2019
[24]

Siska Fitrianie and Iulia Lefter. 2023. On Head Motion for Recognizing Aggression and Negative Affect during Speaking and Listening. In Proceedings of the 25th International Conference on Multimodal Interaction. 455–464

work page 2023
[25]

Gerald Friedland, Chuohao Yeo, and Hayley Hung. 2009. Visual speaker localization aided by acoustic models. InProceedings of the 17th ACM international conference on Multimedia. 195–202

work page 2009
[26]

Alexandra Frischen, Andrew P Bayliss, and Steven P Tipper. 2007. Gaze cueing of attention: visual attention, social cognition, and individual differences.Psychological bulletin133, 4 (2007), 694

work page 2007
[27]

David Gaddy and Dan Klein. 2020. Digital Voicing of Silent Speech. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 5521–5530. https://doi.org/10.18653/v1/2020.emnlp-main.445

work page doi:10.18653/v1/2020.emnlp-main.445 2020
[28]

Stuart Gatehouse and William Noble. 2004. The speech, spatial and qualities of hearing scale (SSQ).International journal of audiology43, 2 (2004), 85–99

work page 2004
[29]

Linfei Ge, Qian Zhang, Jin Zhang, and Huangxun Chen. 2023. Ehtrack: Earphone-based head tracking via only acoustic signals.IEEE Internet of Things Journal(2023)

work page 2023
[30]

Ekin Gedik and Hayley Hung. 2018. Detecting conversing groups using social dynamics from wearable acceleration: Group size awareness. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies2, 4 (2018), 1–24

work page 2018
[31]

Gerard E Grossman, R John Leigh, Larry A Abel, Douglas J Lanska, and SE Thurston. 1988. Frequency and velocity of rotational head perturbations during locomotion.Experimental brain research70 (1988), 470–476

work page 1988
[32]

Uri Hadar. 1991. Speech-related body movement in aphasia: period analysis of upper arms and head movement.Brain and Language41, 3 (1991), 339–366

work page 1991
[33]

Peter Hausamann, Christian B Sinnott, Martin Daumer, and Paul R MacNeilage. 2021. Evaluation of the Intel RealSense T265 for tracking natural human head motion.Scientific reports11, 1 (2021), 12486

work page 2021
[34]

Maartje ME Hendrikse, Gerard Llorach, Volker Hohmann, and Giso Grimm. 2019. Movement and gaze behavior in virtual audiovisual listening environments resembling everyday life.Trends in Hearing23 (2019), 2331216519872362

work page 2019
[35]

Seongsil Heo, Calvin Murdock, Michael Proulx, and Christi Miller. 2025. Gaze-Enhanced Multimodal Turn-Taking Prediction in Triadic Conversations.arXiv preprint arXiv:2505.13688(2025)

work page arXiv 2025
[36]

Ĺuboš Hládek, Bernd Porr, Graham Naylor, Thomas Lunner, and W Owen Brimijoin. 2019. On the interaction of head and gaze control with acoustic beam width of a simulated beamformer in a two-talker scenario.Trends in Hearing23 (2019), 2331216519876795

work page 2019
[37]

Osamu Hoshuyama, Akihiko Sugiyama, and Akihiro Hirano. 2002. A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters.IEEE Transactions on signal processing47, 10 (2002), 2677–2684

work page 2002
[38]

Gongping Huang, Jesper R Jensen, Jingdong Chen, Jacob Benesty, Mads G Christensen, Akihiko Sugiyama, Gary Elko, and Tomas Gaensler

work page
[39]

InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Advances in microphone array processing and multichannel speech enhancement. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025
[40]

Hayley Hung, Gwenn Englebienne, and Jeroen Kools. 2013. Classifying social actions with a single accelerometer. InProceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. 207–210

work page 2013
[41]

Apple Inc. 2016. Apple AirPods. https://www.apple.com/airpods/ Accessed: 2024-12-16

work page 2016
[42]

Yan-Bin Jia. 2019. Quaternions.Com S477 (2019), 577

work page 2019
[43]

Hao Jiang and Vamsi Krishna Ithapu. 2021. Egocentric pose estimation from human vision span. In2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 10986–10994

work page 2021
[44]

Hao Jiang, Calvin Murdock, and Vamsi Krishna Ithapu. 2022. Egocentric deep multi-channel audio-visual active speaker localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10544–10552

work page 2022
[45]

Charles Knapp and Glifford Carter. 2003. The generalized correlation method for estimation of time delay.IEEE transactions on acoustics, speech, and signal processing24, 4 (2003), 320–327. 26•Payal Mohapatra, Calvin Murdock, Ali Aroudi, Ishwarya Ananthabhotla, Anjali Menon, Buye Xu, and Morteza Khaleghimeybodi

work page 2003
[46]

Hol, and Thomas B

Manon Kok, Jeroen D. Hol, and Thomas B. Schön. 2017. Using inertial sensors for position and orientation estimation.Foundations and Trends in Signal Processing11, 1–2 (2017), 1–153. https://doi.org/10.1561/2000000094

work page doi:10.1561/2000000094 2017
[47]

Angkana Lertpoompunya, Erol J Ozmeral, Nathan C Higgins, and David A Eddins. 2024. Head-orienting behaviors during simultaneous speech detection and localization.Frontiers in Psychology15 (2024), 1425972

work page 2024
[48]

Yin Li, Alireza Fathi, and James M Rehg. 2013. Learning to predict gaze in egocentric video. InProceedings of the IEEE international conference on computer vision. 3216–3223

work page 2013
[49]

2007.Speech enhancement: theory and practice

Philipos C Loizou. 2007.Speech enhancement: theory and practice. CRC press

work page 2007
[50]

Hao Lu and W Owen Brimijoin. 2022. Sound source selection based on head movements in natural group conversation.Trends in Hearing26 (2022), 23312165221097789

work page 2022
[51]

Hao Lu, Martin F McKinney, Tao Zhang, and Andrew J Oxenham. 2021. Investigating age, hearing loss, and background noise effects on speaker-targeted head and eye movements in three-way conversations.The Journal of the Acoustical Society of America149, 3 (2021), 1889–1900

work page 2021
[52]

A Lundström, F Lundström, LML Lebret, and CFA Moorrees. 1995. Natural head position and natural head orientation: basic considerations in cephalometric analysis and research.European Journal of Orthodontics17, 2 (1995), 111–120

work page 1995
[53]

Naoya Maruyama, Yasuhiro Hiraguri, Keiji Kawai, and Mari Ueda. 2020. Assessing the ease of conversation in multi-group conversation spaces: Effect of background music volume on acoustic comfort in a café.Building Acoustics27, 2 (2020), 137–153

work page 2020
[54]

Aleksandar Matic, Venet Osmani, and Oscar Mayora-Ibarra. 2012. Analysis of social interactions through mobile phones.Mobile Networks and Applications17 (2012), 808–819

work page 2012
[55]

Akemi Matsuo, Taku Itami, and Jun Yoneyama. 2024. 360°Sound Localization Support System for Deaf and Hard-of-Hearing People Using Smartglasses Equipped with Two Microphone. In2024 IEEE/SICE International Symposium on System Integration (SII). IEEE, 295–300

work page 2024
[56]

Evelyn Z McClave. 2000. Linguistic functions of head movements in the context of speech.Journal of pragmatics32, 7 (2000), 855–878

work page 2000
[57]

Ronan McGarrigle, Sarah Knight, Lyndon Rakusen, Jason Geller, and Sven Mattys. 2021. Older adults show a more sustained pattern of effortful listening than young adults.Psychology and aging36, 4 (2021), 504

work page 2021
[58]

William H McKellin, Kimary Shahin, Murray Hodgson, Janet Jamieson, and Kathleen Pichora-Fuller. 2007. Pragmatics of conversation and communication in noisy settings.Journal of Pragmatics39, 12 (2007), 2159–2184

work page 2007
[59]

Ravish Mehra, Owen Brimijoin, Philip Robinson, and Thomas Lunner. 2020. Potential of augmented reality platforms to improve individual hearing aids and to support more ecologically valid research.Ear and hearing41 (2020), 140S–146S

work page 2020
[60]

Johannes Meyer, Adrian Frank, Thomas Schlebusch, and Enkelejda Kasneci. 2022. U-har: A convolutional approach to human activity recognition combining head and eye movements for context-aware smart glasses.Proceedings of the ACM on Human-Computer Interaction6, ETRA (2022), 1–19

work page 2022
[61]

Daniel A Mitchell and Boaz Rafaely. 2023. Study of speaker localization under dynamic and reverberant environments.arXiv preprint arXiv:2311.16927(2023)

work page arXiv 2023
[62]

Daniel Asher Mitchell, Boaz Rafaely, Anurag Kumar, and Vladimir Tourbabin. 2025. Improved direction of arrival estimations with a wearable microphone array for dynamic environments by reliability weighting.EURASIP Journal on Advances in Signal Processing2025, 1 (2025), 42

work page 2025
[63]

II MLKDD. [n. d.]. Multi-label Classification. ([n. d.])

work page
[64]

Payal Mohapatra, Ali Aroudi, Anurag Kumar, and Morteza Khaleghimeybodi. 2024. Non-verbal hands-free control for smart glasses using teeth clicks.arXiv preprint arXiv:2408.11346(2024)

work page arXiv 2024
[65]

Payal Mohapatra, Akash Pandey, Sinan Keten, Wei Chen, and Qi Zhu. 2023. Person identification with wearable sensing using missing feature encoding and multi-stage modality fusion. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–2

work page 2023
[66]

Payal Mohapatra, Akash Pandey, Xiaoyuan Zhang, and Qi Zhu. 2025. Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs.arXiv preprint arXiv:2506.00304(2025)

work page arXiv 2025
[67]

Payal Mohapatra, Yueyuan Sui, Akash Pandey, Stephen Xia, and Qi Zhu. 2025. MAESTRO: Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series.arXiv preprint arXiv:2509.25278(2025)

work page arXiv 2025
[68]

Vimal Mollyn, Riku Arakawa, Mayank Goel, Chris Harrison, and Karan Ahuja. 2023. Imuposer: Full-body pose estimation using imus in phones, watches, and earbuds. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–12

work page 2023
[69]

Sounak Mondal, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Gregory Zelinsky, and Minh Hoai. 2023. Gazeformer: Scalable, effective and fast prediction of goal-directed human attention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1441–1450

work page 2023
[70]

Alastair H Moore, Jan Mark de Haan, Michael Syskind Pedersen, Patrick A Naylor, Mike Brookes, and Jesper Jensen. 2019. Personalized signal-independent beamforming for binaural hearing aids.The Journal of the Acoustical Society of America145, 5 (2019), 2971–2981

work page 2019
[71]

Louis-Philippe Morency and Trevor Darrell. 2006. Head gesture recognition in intelligent interfaces: the role of context in improving recognition. InProceedings of the 11th international conference on Intelligent user interfaces. 32–38

work page 2006
[72]

Calvin Murdock, Ishwarya Ananthabhotla, Hao Lu, and Vamsi Krishna Ithapu. 2024. Self-Motion As Supervision For Egocentric Audiovisual Localization. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7835–7839

work page 2024
[73]

Gergely Nagymáté and Rita M Kiss. 2018. Application of OptiTrack motion capture systems in human movement analysis: A systematic literature review.Recent Innovations in Mechatronics5, 1 (2018), 1–9. Towards Localizing Conversation Partners using Head Motion•27

work page 2018
[74]

OptiTrack. 2024. OptiTrack - Motion Capture Systems. https://optitrack.com/

work page 2024
[75]

Akash Pandey, Payal Mohapatra, Wei Chen, Qi Zhu, and Sinan Keten. 2026. TimeSliver: Symbolic-Linear Decomposition for Explainable Time Series Classification.arXiv preprint arXiv:2601.21289(2026)

work page arXiv 2026
[76]

Karl Pearson and Francis Galton. 1895. VII. Note on regression and inheritance in the case of two parents.Proceedings of the Royal Society of London 58, 347-352 (1895), 240–242. https://doi.org/10.1098/rspl.1895.0041 arXiv:https://royalsocietypublishing.org/doi/pdf/10.1098/rspl.1895.0041

work page doi:10.1098/rspl.1895.0041
[77]

Michael Price, James Glass, and Anantha P Chandrakasan. 2017. A low-power speech recognizer and voice activity detector using deep neural networks.IEEE Journal of Solid-State Circuits53, 1 (2017), 66–75

work page 2017
[78]

Qiwu Qin and Yian Zhu. 2025. Robust Audio–Visual Speaker Localization in Noisy Aircraft Cabins for Inflight Medical Assistance.Sensors25, 18 (2025), 5827

work page 2025
[79]

Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification.Machine learning85 (2011), 333–359

work page 2011
[80]

Rutger Rienks, Ronald Poppe, and Dirk Heylen. 2005. Differences in head orientation between speakers and listeners in multi-party conversations. International Journal HCS(2005)

work page 2005

Showing first 80 references.