pith. sign in

arxiv: 2507.12002 · v2 · submitted 2025-07-16 · 💻 cs.LG

Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing

Pith reviewed 2026-05-19 04:59 UTC · model grok-4.3

classification 💻 cs.LG
keywords smartwatch sensingconversation detectionmultimodal fusioninertial signalsaudio processingsocial interactionneural networks
0
0 comments X

The pith

Smartwatch audio combined with wrist motion data detects face-to-face conversations even in noisy environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to spot in-person verbal conversations using only a commodity smartwatch. It synchronizes microphone audio with 6-axis inertial signals from the accelerometer and gyroscope, then trains convolutional and attention-based neural networks on three fusion approaches. The key improvement comes from motion data capturing non-verbal cues like gestures and body shifts that audio alone misses in noisy settings. Evaluation on a lab study with 11 people and a semi-naturalistic study with 24 people shows the fused model reaching 82 percent macro F1 in controlled conditions and 77 percent in more realistic ones. The same trained model runs in real time on a commercial smartwatch app.

Core claim

Fusing inertial data with audio significantly improves detection performance by capturing non-verbal conversational dynamics. The framework achieves 82.0±3.0% macro F1-score when detecting conversations in the lab and 77.2±1.8% in the semi-naturalistic setting, and it supports real-time deployment on commercial smartwatches.

What carries the argument

Multimodal fusion of microphone audio with 6-axis inertial signals inside convolutional and attention-based neural networks

If this is right

  • Audio-only detection misses important conversational cues that wrist motion provides.
  • The same fusion approach works across both controlled lab and semi-naturalistic environments.
  • Real-time conversation detection becomes feasible on existing commercial smartwatches without extra hardware.
  • Social interaction tracking can rely on passive wrist-worn sensors instead of cameras or dedicated microphones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This sensing method could support longitudinal studies of daily social contact patterns in health or workplace research.
  • Privacy-conscious users might prefer wrist-based detection over always-on phone microphones for logging conversations.
  • The approach opens a path to context-aware features such as automatic note-taking during meetings or reminders based on detected social time.

Load-bearing premise

The semi-naturalistic study with 24 participants captures enough real-world noise and variability for the detection performance to hold outside the study conditions.

What would settle it

Running the trained model on audio and motion data collected from a new group of participants in a crowded, unstructured public space where background noise and movement patterns differ markedly, and finding macro F1 below 65 percent.

Figures

Figures reproduced from arXiv: 2507.12002 by Alice Zhang, Callihan Bertley, Dawei Liang, Edison Thomaz.

Figure 1
Figure 1. Figure 1: Participants from different sessions performing group activities performed in the [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example raw acoustic and inertial data of one participant across all three classes. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example inertial energies and acoustic spectrograms of one participant across all three classes. The [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the audio classifier. 5.3.1 Neural Network Models. The neural networks were inspired by and built upon the following models: (1) Shallow Convolutional Neural Network with Batch Normalization (SCNNB) [26] and (2) Attend&Discriminate [1]. The architectures of both models are illustrated in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architectures of motion models explored in this work. Both models take as input IMU energy distribu [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An overview of fusion strategies explored in this work. Audio and inertial embeddings are first extracted [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Confusion matrices for inertial-only (left), audio-only (center), and audio-inertial frameworks (right). [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A comparison of LOGO evaluation results (macro F1-score) across audio, inertial, and multimodal [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: shows that multimodal performance improves as window length increases, with peak performance at a window length of 30 seconds. This is in line with previous works that have found a 30 second window to provide the best tradeoff between classification accuracy and robustness [4, 12, 29, 44]. Although longer windows give better model performance, it reduces the granularity of conversation detection as each w… view at source ↗
Figure 11
Figure 11. Figure 11: Macro (left) and weighted (right) F1-scores across conversation [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Confusion matrices from inference of the pre-trained audio model on our collected dataset. Left: [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
read the original abstract

Social interactions play a crucial role in shaping human behavior, relationships, and societies. It encompasses various forms of communication, such as verbal conversation, non-verbal gestures, facial expressions, and body language. In this work, we develop a novel computational approach to detect face-to-face verbal conversations, a foundational aspect of human social interactions. We leverage multimodal data captured by a commodity smartwatch, specifically synchronizing microphone audio with 6-axis inertial signals (accelerometer and gyroscope). We design, train, and evaluate convolutional and attention-based neural networks using three different fusion methods to integrate the audio and motion modalities. To validate this framework, we conduct a lab study with 11 participants and a semi-naturalistic study with 24 participants. Our comprehensive evaluation demonstrates that fusing inertial data with audio significantly improves detection performance by capturing non-verbal conversational dynamics. Overall, our framework achieved 82.0$\pm$3.0% macro F1-score when detecting conversations in the lab and 77.2$\pm$1.8% in the semi-naturalistic setting. Lastly, we demonstrate real-time conversation detection by deploying our trained model to a user application running on a commercial smartwatch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a multimodal framework for detecting face-to-face verbal conversations using synchronized audio and 6-axis inertial (accelerometer + gyroscope) data from a commodity smartwatch. It trains convolutional and attention-based neural networks with three fusion strategies, evaluates them on a lab study (N=11) and a semi-naturalistic study (N=24), reports macro F1 scores of 82.0±3.0% and 77.2±1.8%, and demonstrates real-time on-device deployment.

Significance. If the reported fusion gains and generalization hold, the work would provide a practical, privacy-preserving method for capturing social interaction dynamics via wearables, with potential value for behavioral science and health applications. The combination of audio with motion to capture non-verbal cues and the successful commercial-hardware deployment are concrete strengths.

major comments (2)
  1. [Evaluation sections (lab and semi-naturalistic studies)] Evaluation (lab and semi-naturalistic studies): The central claim that inertial fusion enables reliable detection 'in noisy real-world environments' rests on the semi-naturalistic protocol being representative of background noise, motion patterns, conversation durations, and device placement variability. With only 24 participants and no reported cross-environment, cross-device, or leave-one-environment-out testing, systematic differences between the study setting and true field conditions could inflate the 77.2±1.8% F1 without contradicting the internal numbers.
  2. [Methods and results sections] Methods and results: Full training details, data-split procedure, and hyperparameter search are not fully specified, making it difficult to reproduce the reported fusion benefit or to determine whether the ±3.0% and ±1.8% intervals reflect multiple random seeds, cross-validation folds, or participant-level variability.
minor comments (2)
  1. [Model architecture] Clarify the exact definitions and implementation differences among the three fusion methods (early, late, hybrid) in the model architecture description.
  2. [Discussion or conclusion] Add a limitations paragraph explicitly discussing the gap between the semi-naturalistic protocol and uncontrolled real-world conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have addressed the concerns about the representativeness of the semi-naturalistic evaluation and the lack of methodological detail by expanding the relevant sections and adding a limitations discussion. Below we respond point by point.

read point-by-point responses
  1. Referee: [Evaluation sections (lab and semi-naturalistic studies)] Evaluation (lab and semi-naturalistic studies): The central claim that inertial fusion enables reliable detection 'in noisy real-world environments' rests on the semi-naturalistic protocol being representative of background noise, motion patterns, conversation durations, and device placement variability. With only 24 participants and no reported cross-environment, cross-device, or leave-one-environment-out testing, systematic differences between the study setting and true field conditions could inflate the 77.2±1.8% F1 without contradicting the internal numbers.

    Authors: We acknowledge that a single semi-naturalistic environment cannot fully capture all possible real-world variabilities. The study was conducted in an actual café during normal operating hours, with measured background noise levels averaging 62–68 dB, natural participant motion (walking, gesturing, eating), variable conversation lengths (2–12 min), and smartwatch placement on the non-dominant wrist. All 24 participants were distinct from the lab cohort. While we did not conduct leave-one-environment-out or cross-device experiments, the fusion strategy still yields a consistent 4–6 point macro-F1 gain over audio-only baselines in this setting. In the revision we have added a detailed protocol description, noise measurements, and an explicit limitations paragraph noting the need for future multi-site validation. We believe these changes better contextualize the 77.2 % result without overstating generalizability. revision: partial

  2. Referee: [Methods and results sections] Methods and results: Full training details, data-split procedure, and hyperparameter search are not fully specified, making it difficult to reproduce the reported fusion benefit or to determine whether the ±3.0% and ±1.8% intervals reflect multiple random seeds, cross-validation folds, or participant-level variability.

    Authors: We agree that the original submission omitted key reproducibility information. The revised Methods section now specifies: (i) the exact hyperparameter search ranges and final values (learning rate 1e-4, batch size 32, 50 epochs, early stopping patience 5), (ii) the participant-independent split (70 % train / 15 % val / 15 % test, ensuring no user overlap), and (iii) that the reported ±3.0 % and ±1.8 % are standard deviations over five independent training runs with different random seeds. These additions allow readers to reproduce both the fusion gains and the reported variability. revision: yes

Circularity Check

0 steps flagged

No circularity: results from empirical held-out evaluation on collected participant data

full rationale

The paper describes data collection from a lab study (11 participants) and semi-naturalistic study (24 participants), followed by training convolutional and attention-based networks using three fusion methods on synchronized audio and 6-axis inertial signals. Reported metrics (82.0±3.0% macro F1 in lab, 77.2±1.8% in semi-naturalistic) are measured performance on held-out data splits, with real-time deployment on commercial hardware. No equations, parameters, or claims reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation chains; the evaluation is externally falsifiable via replication on new participants or environments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on standard neural network training assumptions and the representativeness of the two user studies for real-world generalization; no major invented entities or ad-hoc parameters beyond typical model training choices.

free parameters (1)
  • neural network hyperparameters
    Learning rate, batch size, and architecture choices for the convolutional and attention models are selected during training to achieve the reported performance.
axioms (1)
  • domain assumption Data collected in the lab study with 11 participants and semi-naturalistic study with 24 participants is representative of noisy real-world conversational environments
    Performance claims and generalization to real-world use rest on this assumption about study conditions.

pith-pipeline@v0.9.0 · 5747 in / 1333 out tokens · 48431 ms · 2026-05-19T04:59:12.470979+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 1 internal anchor

  1. [1]

    Ranasinghe

    Alireza Abedin, Mahsa Ehsanpour, Qinfeng Shi, Hamid Rezatofighi, and Damith C. Ranasinghe. 2021. Attend and Discriminate: Beyond the State-of-the-Art for Human Activity Recognition Using Wearable Sensors.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 1, Article 1 (mar 2021), 22 pages. https://doi.org/10.1145/3448083

  2. [2]

    Apple. 2023. AirPods redefine the personal audio experience. https://www.apple.com/newsroom/2023/06/airpods- redefine-the-personal-audio-experience/ Accessed: 2023-12-19

  3. [3]

    Adams, Md

    Rummana Bari, Roy J. Adams, Md. Mahbubur Rahman, Megan Battles Parsons, Eugene H. Buder, and Santosh Kumar

  4. [4]

    rConverse: Moment by Moment Conversation Detection Using a Mobile Respiration Sensor. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2, 1, Article 2 (mar 2018), 27 pages. https://doi.org/10.1145/3191734

  5. [5]

    Mahbubur Rahman, Nazir Saleheen, Megan Battles Parsons, Eugene H

    Rummana Bari, Md. Mahbubur Rahman, Nazir Saleheen, Megan Battles Parsons, Eugene H. Buder, and Santosh Kumar

  6. [6]

    4, 4, Article 117 (dec 2020), 23 pages

    Automated Detection of Stressful Conversations Using Wearable Physiological and Inertial Sensors. 4, 4, Article 117 (dec 2020), 23 pages. https://doi.org/10.1145/3432210

  7. [7]

    Janet Bavelas and Jennifer Gerwing. 2007. Conversational hand gestures and facial displays in face-to-face dialogue . 283–308

  8. [8]

    Vincent Becker, Linus Fessler, and Gábor Sörös. 2019. GestEar: combining audio and motion sensing for gesture recognition on smartwatches. In 2019 ISWC (London, United Kingdom) (ISWC ’19). ACM, New York, NY, USA, 10–19. https://doi.org/10.1145/3341163.3347735

  9. [9]

    Sarnab Bhattacharya, Rebecca Adaimi, and Edison Thomaz. 2022. Leveraging Sound and Wrist Motion to Detect Activities of Daily Living with Commodity Smartwatches. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 2, Article 42 (jul 2022), 28 pages. https://doi.org/10.1145/3534582

  10. [10]

    Hervé Bredin. 2023. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proc. INTERSPEECH 2023

  11. [11]

    Esra Nur Çatak, Alper Açık, and Tilbe Göksun. 2018. The relationship between handedness and valence: A gesture study. Quarterly Journal of Experimental Psychology 71, 12 (2018), 2615–2626. https://doi.org/10.1177/1747021817750110 arXiv:https://doi.org/10.1177/1747021817750110 PMID: 29355469

  12. [12]

    Ciro Cattuto, Wouter Van den Broeck, Alain Barrat, Vittoria Colizza, Jean-François Pinton, and Alessandro Vespignani

  13. [13]

    PLOS ONE 5, 7 (07 2010), 1–9

    Dynamics of Person-to-Person Interactions from Distributed RFID Sensor Networks. PLOS ONE 5, 7 (07 2010), 1–9. https://doi.org/10.1371/journal.pone.0011596

  14. [14]

    Metod Celestina, Jan Hrovat, and Chucri A. Kardous. 2018. Smartphone-based sound level measurement apps: Evaluation of compliance with international sound level meter standards. Applied Acoustics 139 (2018), 119–128. https://doi.org/10.1016/j.apacoust.2018.04.011

  15. [15]

    Tanzeem Choudhury and Alex Pentland. 2003. Sensing and Modeling Human Networks. (01 2003)

  16. [16]

    International Noise Awareness Day. 2024. https://noiseawareness.org/info-center/common-noise-levels/ Accessed: 2024-09-05

  17. [17]

    Susan Kay Donaldson. 1979. One Kind of Speech Act: How Do We Know When We ?Re Conversing?? Semiotica 28, 3-4 (1979). https://doi.org/10.1515/semi.1979.28.3-4.259

  18. [18]

    Feldman and J.M

    R.S. Feldman and J.M. Tyler. 2006. Factoring in age: Nonverbal communication across the life span . 181–200. https: //doi.org/10.4135/9781412976152.n10

  19. [19]

    Georgi Gerganov. 2023. whisper.cpp. https://github.com/ggerganov/whisper.cpp

  20. [20]

    Godfrey, Edward C

    John J. Godfrey, Edward C. Holliman, and Jane McDaniel. 1992. SWITCHBOARD: telephone speech corpus for research and development. In 1992 ICASSP (San Francisco, California) (ICASSP’92). IEEE Computer Society, USA, 517–520

  21. [21]

    Ki-Won Haan, Christoph Riedl, and Anita Woolley. 2021. Discovering Where We Excel: How Inclusive Turn-Taking in Conversation Improves Team Performance. In 2021 ICMI (Montreal, QC, Canada) (ICMI ’21 Companion). ACM, New York, NY, USA, 278–283. https://doi.org/10.1145/3461615.3485417

  22. [22]

    Jay Hall and W. H. Watson. 1970. The Effects of a Normative Intervention on Group Decision-Making Performance. Human Relations 23, 4 (1970), 299–317. https://doi.org/10.1177/001872677002300404

  23. [23]

    Mattias Heldner and Jens Edlund. 2010. Pauses, gaps and overlaps in conversations. Journal of Phonetics 38 (10 2010), 555–568. https://doi.org/10.1016/j.wocn.2010.08.002

  24. [24]

    Struggling to stay connected

    Katerina Hilari and Sarah Northcott. 2016. “Struggling to stay connected”: comparing the social relationships of healthy older people and people with stroke and aphasia. Aphasiology (08 2016), 1–14. https://doi.org/10.1080/02687038.2016. , Vol. 1, No. 1, Article . Publication date: September 2025. 24 Zhang et al. 1218436

  25. [25]

    Kendrick, Marisa Casillas, and Stephen C

    Judith Holler, Kobin H. Kendrick, Marisa Casillas, and Stephen C. Levinson (Eds.). 2016. Turn-Taking in Human Communicative Interaction. Frontiers Media S.A., Lausanne. https://eprints.whiterose.ac.uk/116190/

  26. [26]

    Chiao-Yin Hsiao, Wan-Rong Jih, and Jane Hsu. 2012. Recognizing Continuous Social Engagement Level in Dyadic Conversation by Using Turn-taking and Speech Emotion Patterns

  27. [27]

    Hyunchoong Kim, Jonghoon Shin, Soohwan Kim, Yohan Ko, Kyoungwoo Lee, Hojung Cha, Seong Il Hahm, and Taejun Kwon. 2017. Collaborative classification for daily activity recognition with a smartwatch (2016 IEEE International Conference on Systems, Man, and Cybernetics) . Institute of Electrical and Electronics Engineers Inc., 3707–3712. https: //doi.org/10.1...

  28. [28]

    Youngki Lee, Chulhong Min, Chanyou Hwang, Jaeung Lee, Inseok Hwang, Younghyun Ju, Chungkuk Yoo, Miri Moon, Uichin Lee, and Junehwa Song. 2013. SocioPhone: Everyday face-to-face interaction monitoring platform using multi-phone sensor fusion. MobiSys 2013, 375–388. https://doi.org/10.1145/2462456.2465426

  29. [29]

    Fangyuan Lei, Xun Liu, Qingyun Dai, and Bingo Ling. 2020. Shallow convolutional neural network for image classification. SN Applied Sciences 2 (01 2020). https://doi.org/10.1007/s42452-019-1903-4

  30. [30]

    Qiang Li, Shanshan Chen, and John Stankovic. 2013. Multi-modal in-person interaction monitoring using smartphone and on-body sensors. 2013 IEEE International Conference on Body Sensor Networks, BSN 2013 , 1–6. https://doi.org/10. 1109/BSN.2013.6575509

  31. [31]

    Dawei Liang, Guihong Li, Rebecca Adaimi, Radu Marculescu, and Edison Thomaz. 2022. AudioIMU: Enhancing Inertial Sensing-Based Activity Recognition with Acoustic Models. In 2022 ISWC (Cambridge, United Kingdom) (ISWC ’22). ACM, New York, NY, USA, 44–48. https://doi.org/10.1145/3544794.3558471

  32. [32]

    Dawei Liang, Alice Zhang, and Edison Thomaz. 2023. Automated Face-To-Face Conversation Detection on a Commodity Smartwatch with Acoustic Sensing. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 3, Article 109 (Sept 2023), 29 pages. https://doi.org/10.1145/3610882

  33. [33]

    Chengwen Luo and Mun Chan. 2013. SocialWeaver: Collaborative inference of human conversation networks using smartphones. (11 2013), 14. https://doi.org/10.1145/2517351.2517353

  34. [34]

    Aleksandar Matic, Venet Osmani, and Oscar Mayora. 2013. Automatic Sensing of Speech Activity and Correlation with Mood Changes. 195–205. https://doi.org/10.1007/978-3-642-32538-0_9

  35. [35]

    David Matsumoto and Hyisung Hwang. 2016. The cultural bases of nonverbal communication. 77–101. https: //doi.org/10.1037/14669-004

  36. [36]

    Albert Mehrabian. 1972. Nonverbal Communication. Routledge

  37. [37]

    Cornelius, Ashwin Ramaswamy, Tanzeem Choudhury, Zhigang Liu, and Andrew T

    Emiliano Miluzzo, Cory T. Cornelius, Ashwin Ramaswamy, Tanzeem Choudhury, Zhigang Liu, and Andrew T. Campbell

  38. [38]

    In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services (San Francisco, California, USA) (MobiSys ’10)

    Darwin Phones: The Evolution of Sensing and Inference on Mobile Phones. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services (San Francisco, California, USA) (MobiSys ’10). ACM, New York, NY, USA, 5–20. https://doi.org/10.1145/1814433.1814437

  39. [39]

    Vimal Mollyn, Karan Ahuja, Dhruv Verma, Chris Harrison, and Mayank Goel. 2022. SAMoSA: Sensing Activities with Motion and Subsampled Audio. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 3, Article 132 (sep 2022), 19 pages. https://doi.org/10.1145/3550284

  40. [40]

    Debbie S Moskowitz and Simon N Young. 2006. Ecological momentary assessment: what it is and why it is a method of the future in clinical psychopharmacology. J. Psychiatry Neurosci. 31, 1 (Jan. 2006), 13–20

  41. [41]

    Narayanan

    Amrutha Nadarajan, Krishna Somandepalli, and Shrikanth S. Narayanan. 2019. Speaker Agnostic Foreground Speech Detection from Audio Recordings in Workplace Settings from Wearable Recorders. 2019 ICASSP (2019), 6765–6769. https://api.semanticscholar.org/CorpusID:145905315

  42. [42]

    Waber, Taemie Kim, Akshay Mohan, Koji Ara, and Alex Pentland

    Daniel Olguin Olguin, Benjamin N. Waber, Taemie Kim, Akshay Mohan, Koji Ara, and Alex Pentland. 2009. Sensible Organizations: Technology and Methodology for Automatically Measuring Organizational Behavior. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 1 (2009), 43–55. https://doi.org/10.1109/TSMCB.2008.2006638

  43. [43]

    World Health Organization. 2021. Social isolation and loneliness among older people: advocacy brief. https://www.decadeofhealthyageing.org/find-knowledge/resources/publications/un-decade-of-healthy-ageing- advocacy-brief-social-isolation-and-loneliness-among-older-people Accessed: 2023-12-19

  44. [44]

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 ICASSP. 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964

  45. [45]

    Matthew Pantell, David Rehkopf, Douglas Jutte, S Leonard Syme, John Balmes, and Nancy Adler. 2013. Social isolation: a predictor of mortality comparable to traditional clinical risk factors. Am. J. Public Health 103, 11 (Nov. 2013), 2056–2062

  46. [46]

    Maciej Pawłowski, Anna Wróblewska, and Sylwia Sysko-Romańczuk. 2023. Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 23, 5 (2023). https://doi.org/10.3390/s23052381

  47. [47]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. https://doi.org/10.48550/ARXIV.2212.04356 , Vol. 1, No. 1, Article . Publication date: September 2025. Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Mot...

  48. [48]

    Md Rahman, Amin Ali, Kurt Plarre, Mustafa al’Absi, Emre Ertin, and Santosh Kumar. 2011. MConverse: Inferring conversation episodes from respiratory measurements collected in the field. Proceedings - Wireless Health 2011, WH’11,

  49. [49]

    https://doi.org/10.1145/2077546.2077557

  50. [50]

    Mary Redmayne. 2017. Where’s Your Phone? A Survey of Where Women Aged 15-40 Carry Their Smartphone and Related Risk Perception: A Survey and Pilot Study. PLOS ONE 12 (01 2017), e0167996. https://doi.org/10.1371/journal. pone.0167996

  51. [51]

    Reis and Ladd Wheeler

    Harry T. Reis and Ladd Wheeler. 1991. Studying Social Interaction with the Rochester Interaction Record . Elsevier, 269–318. https://doi.org/10.1016/s0065-2601(08)60332-9

  52. [52]

    Sujiya S and Dr.Chandra E. 2017. A Review on Speaker Recognition.International Journal of Engineering and Technology 9 (06 2017), 1592–1598. https://doi.org/10.21817/ijet/2017/v9i3/170903513

  53. [53]

    Harvey Sacks, Emanuel Schegloff, and Gail Jefferson. 1974. A Simple Systematic for the Organisation of Turn Taking in Conversation. Language 50 (12 1974), 696–735. https://doi.org/10.2307/412243

  54. [54]

    Florian Schiel, Christoph Draxler, Angela Baumann, Tania Ellbogen, and Alex T. Steffen. 2012. The Production of Speech Corpora. https://api.semanticscholar.org/CorpusID:61111457

  55. [55]

    Nabeel Siddiqui and Rosa Chan. 2020. Multimodal hand gesture recognition using single IMU and acoustic measure- ments at wrist. PLOS ONE 15 (01 2020), e0227039. https://doi.org/10.1371/journal.pone.0227039

  56. [56]

    NIOSH Hearing Loss Prevention Team. 2019. NIOSH Sound Level Meter Application (app) for iOS devices

  57. [57]

    Dipanwita Thakur and Suparna Biswas. 2021. Feature fusion using deep learning for smartphone based human activity recognition. International Journal of Information Technology 13 (06 2021). https://doi.org/10.1007/s41870-021-00719-6

  58. [58]

    Yonatan Vaizman, Katherine Ellis, Gert Lanckriet, and Nadir Weibel. 2018. ExtraSensory App: Data Collection In- the-Wild with Rich User Interface to Self-Report Behavior. In 2018 CHI (CHI ’18). ACM, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3174128

  59. [59]

    Yonatan Vaizman, Nadir Weibel, and Gert Lanckriet. 2018. Context Recognition In-the-Wild: Unified Model for Multi-Modal Sensors and Multi-Label Classification. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1 (01 2018), 1–22. https://doi.org/10.1145/3161192

  60. [60]

    Martin Warren. 2006. Features of Naturalness in Conversation . https://doi.org/10.1075/pbns.152

  61. [61]

    Gary Weiss, Jeffrey Lockhart, Tony Pulickal, Paul McHugh, Isaac Ronan, and Jessica Timko. 2016. Actitracker: A Smartphone-Based Activity Recognition System for Improving Health and Well-Being. 682–688. https://doi.org/10. 1109/DSAA.2016.89

  62. [62]

    Kelly White, Samuel Tate, Ros Zafonte, Shrikanth Narayanan, Matthias Mehl, Min Shin, and Amar Dhand. 2023. SocialBit: protocol for a prospective observational study to validate a wearable social sensor for stroke survivors with diverse neurological abilities. BMJ Open 13, 8. https://doi.org/10.1136/bmjopen-2023-076297

  63. [63]

    Peter Wittenburg, Hennie Brugman, Albert Russel, Alex Klassmann, and Han Sloetjes. 2006. ELAN: a Professional Framework for Multimodality Research. In 5th International Conference on Language Resources and Evaluation (LREC 2006), Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, and Daniel Tapias (Eds.). Europea...

  64. [64]

    Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, and Bernhard Firner

  65. [65]

    In 2013 UbiComp (Zurich, Switzerland) (UbiComp ’13)

    Crowd++: unsupervised speaker count with smartphones. In 2013 UbiComp (Zurich, Switzerland) (UbiComp ’13). ACM, New York, NY, USA, 43–52. https://doi.org/10.1145/2493432.2493435

  66. [66]

    Ruiqing Yin, Hervé Bredin, and Claude Barras. 2017. Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks. In 2017 INTERSPEECH. Stockholm, Sweden. https://github.com/yinruiqing/ change_detection

  67. [67]

    Huanle Zhang, Wan Du, Pengfei Zhou, Mo Li, and Prasant Mohapatra. 2016. DopEnc: Acoustic-Based Encounter Profiling Using Smartphones. In 2016 MobiCom (New York City, New York)(MobiCom ’16). ACM, New York, NY, USA, 294–307. https://doi.org/10.1145/2973750.2973775 , Vol. 1, No. 1, Article . Publication date: September 2025