pith. machine review for the scientific record. sign in

arxiv: 2604.11570 · v1 · submitted 2026-04-13 · 💻 cs.HC · cs.MM

Recognition: unknown

From Multimodal Signals to Adaptive XR Experiences for De-escalation Training

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3

classification 💻 cs.HC cs.MM
keywords multimodal signalsXR trainingde-escalationadaptive systemslaw enforcement trainingsignal fusionEEGphysiological monitoring
0
0 comments X

The pith

Five synchronized multimodal streams enable real-time adaptation in XR de-escalation training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes an early-stage multimodal system designed as a foundation for adaptive XR training in de-escalation scenarios for law enforcement. It integrates five parallel processing streams—verbal and prosodic speech, skeletal gestures from multi-view cameras, affective analysis combining video and EMG, EEG mental state decoding, and physiological arousal from skin conductance and heart data—all synchronized via Lab Streaming Layer for continuous, temporally aligned assessments of conscious and unconscious cues. An interpretation layer then connects these low-level signals to constructs like escalation and de-escalation, drawing on domain knowledge from police instructors and lay participants to ground the system in realistic conflict situations. The work reports preliminary results on cue extraction feasibility while emphasizing that effective fusion and feedback require design choices rather than purely technical solutions.

Core claim

The authors present the design and implementation of a system integrating five parallel processing streams—verbal and prosodic speech analysis, skeletal gesture recognition from multi-view RGB cameras, multimodal affective analysis combining lower-face video with upper-face facial EMG, EEG-based mental state decoding, and physiological arousal estimation from skin conductance, heart activity, and proxemic behavior—synchronized via Lab Streaming Layer to enable temporally aligned, continuous assessments of users' conscious and unconscious communication cues, with an interpretation layer informed by domain knowledge from police instructors and lay participants to link signals to interactional

What carries the argument

The central mechanism is the Lab Streaming Layer synchronization of five multimodal streams combined with an interpretation layer that maps low-level signal representations to constructs such as escalation and de-escalation using domain knowledge.

If this is right

  • Multi-view sensing and multimodal fusion overcome occlusion and viewpoint challenges from head-mounted displays in gesture and emotion recognition.
  • Automated cue extraction works for verbal assessment, mental state decoding, and physiological arousal in XR training settings.
  • Fusion and feedback must be handled as design problems in human-AI XR training rather than purely technical tasks.
  • The approach supplies design resources and empirical insights for adaptive training in complex interpersonal conflict scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time responses to unconscious physiological and EEG cues could let trainees practice handling stress signals that a human instructor might overlook.
  • Extending the system to new training domains would require fresh domain-specific knowledge to keep the interpretation layer accurate.
  • Mismatches between the interpretation layer and actual field outcomes could arise if the grounding from instructors does not generalize beyond the tested scenarios.

Load-bearing premise

Domain knowledge from police instructors and lay participants can reliably ground the interpretation layer that links low-level signals to escalation and de-escalation constructs in realistic scenarios.

What would settle it

A side-by-side comparison in which the system's real-time cue assessments and adaptive responses in XR scenarios are scored against independent expert ratings by police instructors for alignment on escalation levels.

Figures

Figures reproduced from arXiv: 2604.11570 by Birgit Nierula, Daniel Johannes Meyer, Iryna Ignatieva, Karam Tomotaki-Dawoud, Mina Mottahedin, Sebastian Bosse, Thomas Koch.

Figure 1
Figure 1. Figure 1: Conceptual system overview of the extraction layer integrated into an adaptive XR experience. The sys￾tem objectively assesses user experiences during virtual conflicts by extracting features from signal modalities and fusing them with avatar behaviour and storybook context. The resulting signal adapts the virtual experience. We propose an integrated pipeline that extracts communication cues from multimoda… view at source ↗
Figure 2
Figure 2. Figure 2: Integrated dual-stream pipeline combining gesture and emotion recognition for VR training feedback. The emotion stream (top) fuses lower-face video with upper-face EMG under HMD occlusion; the gesture stream (bottom) processes multi-view skeletal data to detect conflict-relevant body language. Both streams converge into a unified weighted output that delivers real-time, multimodal feedback for skill refine… view at source ↗
Figure 3
Figure 3. Figure 3: Mean values of length and loud￾ness over answers of N=20 par￾ticipants with an indicated lin￾ear regression. Our investigations into automated recognition of com￾munication cues for conflict-related training scenarios yielded promising results across six complementary do￾mains: For verbal communication, pilot test data was col￾lected from N=20 police officers interacting in a proto￾typical VR scene for de-… view at source ↗
Figure 4
Figure 4. Figure 4: Activation patterns in 3 participants related to emotional arousal adapted from [66]. For bodily arousal, physiological markers of emotional arousal were tested in an experimental scenario where the avatar did or did not respect the user’s personal space and showed a neutral or angry emotional facial expression [60]. Multimodal assessment revealed that our analysis methods were sufficiently sensitive to de… view at source ↗
Figure 5
Figure 5. Figure 5: Physiological results adapted from [60]. (A) Discomfort ratings to an avatar that respected the user’s personal space (light green) or violated it (dark green) with neutral or angry facial expression. (B) Effect on SCR amplitudes. There was only a main effect of Personal Space on SCR amplitudes. (C) Interaction effect of the avatar phase (whether approaching or standing in front of the user) on SCR amplitu… view at source ↗
read the original abstract

We present the early-stage design and implementation of a multimodal, real-time communication analysis system intended as a foundational interaction layer for adaptive VR training. The system integrates five parallel processing streams: (1) verbal and prosodic speech analysis, (2) skeletal gesture recognition from multi-view RGB cameras, (3) multimodal affective analysis combining lower-face video with upper-face facial EMG, (4) EEG-based mental state decoding, and (5) physiological arousal estimation from skin conductance, heart activity, and proxemic behavior. All signals are synchronized via Lab Streaming Layer to enable temporally aligned, continuous assessments of users' conscious and unconscious communication cues. Building on concepts from social semiotics and symbolic interactionism, we introduce an interpretation layer that links low-level signal representations to interactional constructs such as escalation and de-escalation. This layer is informed by domain knowledge from police instructors and lay participants, grounding system responses in realistic conflict scenarios. We demonstrate the feasibility and limitations of automated cue extraction in an XR-based de-escalation training project for law enforcement, reporting preliminary results for gesture recognition, emotion recognition under HMD occlusion, verbal assessment, mental state decoding, and physiological arousal. Our findings highlight the value of multi-view sensing and multimodal fusion for overcoming occlusion and viewpoint challenges, while underscoring that fusion and feedback must be treated as design problems rather than purely technical ones. The work contributes design resources and empirical insights for shaping human-AI-powered XR training in complex interpersonal settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the early-stage design and implementation of a multimodal real-time communication analysis system for adaptive XR de-escalation training. It integrates five parallel processing streams (verbal/prosodic, skeletal gesture from multi-view RGB, multimodal affective with lower-face video and upper-face EMG, EEG mental state decoding, and physiological arousal from skin conductance/heart/proxemics) synchronized via Lab Streaming Layer, introduces a domain-informed interpretation layer mapping low-level signals to interactional constructs such as escalation and de-escalation based on police instructor and lay participant knowledge, and reports preliminary feasibility results for individual component accuracies in an XR law enforcement training context while noting limitations of fusion and occlusion challenges.

Significance. If the interpretation layer can be empirically validated, the work could provide useful design resources and empirical insights for human-AI XR training systems in high-stakes interpersonal scenarios, particularly by demonstrating multi-view sensing to address occlusion and by framing multimodal fusion as a design problem informed by social semiotics and symbolic interactionism. The preliminary component results and explicit acknowledgment of domain grounding add practical value for future closed-loop adaptive systems.

major comments (2)
  1. [Interpretation layer and preliminary results description] The central claim that the system enables temporally aligned, continuous assessments of conscious and unconscious cues to drive adaptive XR responses rests on the interpretation layer. However, the manuscript describes this layer conceptually and reports only isolated preliminary accuracies for individual streams (gesture recognition, emotion under HMD occlusion, verbal assessment, EEG decoding, physiological arousal) with no quantitative results on cross-stream fusion performance, agreement with expert ratings of escalation/de-escalation constructs, or closed-loop adaptation efficacy. This leaves the mapping from low-level signals to actionable high-level constructs untested.
  2. [Domain knowledge grounding section] The weakest assumption—that domain knowledge from police instructors and lay participants can reliably ground the interpretation layer in realistic scenarios—is stated but not validated through any reported inter-rater agreement, scenario coverage metrics, or comparison to expert-labeled interactions. This is load-bearing for claims of realistic conflict scenario grounding.
minor comments (2)
  1. [Abstract] The abstract states that preliminary results are reported but does not include specific accuracy values, error bars, or a summary table; adding these in the results section would improve clarity and allow readers to assess component feasibility directly.
  2. [System architecture description] Notation for the five streams and LSL synchronization is introduced without a diagram or explicit timing alignment equation; a figure illustrating the parallel streams and synchronization would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the insightful and constructive comments on our early-stage prototype manuscript. We appreciate the recognition of the system's potential and the clear identification of areas needing stronger framing around its preliminary scope. We address each major comment below and indicate planned revisions to clarify limitations without overstating current results.

read point-by-point responses
  1. Referee: The central claim that the system enables temporally aligned, continuous assessments of conscious and unconscious cues to drive adaptive XR responses rests on the interpretation layer. However, the manuscript describes this layer conceptually and reports only isolated preliminary accuracies for individual streams (gesture recognition, emotion under HMD occlusion, verbal assessment, EEG decoding, physiological arousal) with no quantitative results on cross-stream fusion performance, agreement with expert ratings of escalation/de-escalation constructs, or closed-loop adaptation efficacy. This leaves the mapping from low-level signals to actionable high-level constructs untested.

    Authors: We agree that the interpretation layer is central to enabling adaptive responses and that the reported results are limited to isolated component accuracies, with no fusion performance metrics, expert agreement data, or closed-loop efficacy results included. This manuscript presents an early-stage design and implementation study focused on multimodal integration via Lab Streaming Layer and feasibility of individual streams in an XR law enforcement training context. The layer is introduced conceptually, informed by social semiotics, symbolic interactionism, and initial domain input. In revision, we will expand the description of the layer with more concrete examples of mappings to escalation/de-escalation constructs, explicitly frame the work as preliminary, and discuss the lack of fusion and closed-loop validation as a key limitation and future research direction. We cannot add new quantitative fusion or adaptation results, as these were not part of the current prototype evaluation. revision: partial

  2. Referee: The weakest assumption—that domain knowledge from police instructors and lay participants can reliably ground the interpretation layer in realistic scenarios—is stated but not validated through any reported inter-rater agreement, scenario coverage metrics, or comparison to expert-labeled interactions. This is load-bearing for claims of realistic conflict scenario grounding.

    Authors: We acknowledge that the domain knowledge grounding is a foundational assumption and that the manuscript does not report quantitative validation such as inter-rater agreement, scenario coverage, or comparisons to expert-labeled data. The consultations with police instructors and lay participants were used to shape the interactional constructs, but formal validation metrics were outside the scope of this initial implementation paper. In the revised manuscript, we will provide greater detail on the knowledge elicitation process and more explicitly note the absence of these validation measures as a limitation, while identifying it as an important area for subsequent work. We cannot supply the requested agreement or coverage statistics without new data collection. revision: partial

standing simulated objections not resolved
  • Quantitative results on cross-stream fusion performance, agreement with expert ratings of escalation/de-escalation constructs, or closed-loop adaptation efficacy
  • Inter-rater agreement, scenario coverage metrics, or comparisons to expert-labeled interactions for domain knowledge grounding of the interpretation layer

Circularity Check

0 steps flagged

No circularity: systems description without derivations or self-referential predictions

full rationale

The manuscript is an early-stage systems description of a multimodal XR training platform. It details five synchronized signal streams, Lab Streaming Layer integration, and a conceptually introduced interpretation layer grounded in external domain knowledge from police instructors and participants. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text or abstract. The central claims rest on feasibility demonstrations of individual components and design insights rather than any chain that reduces to its own inputs by construction. Self-citations are absent from the load-bearing elements, and the work does not rename known results or smuggle ansatzes. This is a standard non-circular engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard multimodal processing pipelines plus one domain assumption about expert knowledge grounding the interpretation layer; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Domain knowledge from police instructors and lay participants accurately informs the mapping from low-level signals to interactional constructs such as escalation and de-escalation.
    Invoked in the description of the interpretation layer informed by domain knowledge.

pith-pipeline@v0.9.0 · 5593 in / 1161 out tokens · 34562 ms · 2026-05-10T15:38:27.668486+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Jewitt (Ed.), The Routledge Handbook of Multimodal Analysis, 2 ed., Routledge, London, 2014

    C. Jewitt (Ed.), The Routledge Handbook of Multimodal Analysis, 2 ed., Routledge, London, 2014

  2. [2]

    Otu, Decoding nonverbal communication in law enforcement, Salus Journal 3 (2023) 1–16

    N. Otu, Decoding nonverbal communication in law enforcement, Salus Journal 3 (2023) 1–16. URL: https://journals.csu.domains/index.php/salusjournal/article/view/42

  3. [3]

    J. L. Lakin, Automatic cognitive processes and nonverbal communication, in: V. Manusov, M. L. Patterson (Eds.), The SAGE Handbook of Nonverbal Communication, SAGE Publications, Inc., 2006, pp. 59–78. doi:10.4135/9781412976152.n4

  4. [4]

    R. M. Rozelle, J. C. Baxter, The interpretation of nonverbal behavior in a role-defined interaction sequence: The police–citizen encounter, Journal of Nonverbal Behavior 2 (1978) 167–180. doi: 10. 1007/BF01145819

  5. [5]

    nutrition label

    M. Schmid Mast, G. Cousin, The role of nonverbal communication in medical interactions: Empir- ical results, theoretical bases, and methodological issues, in: L. R. Martin, M. R. DiMatteo (Eds.), The Oxford Handbook of Health Communication, Behavior Change, and Treatment Adherence, Oxford Library of Psychology, Oxford University Press, Oxford, 2013. doi:1...

  6. [6]

    M. D. White, C. Orosco, S. Watts, Beyond force and injuries: Examining alternative (and important) outcomes for police De-escalation training, Journal of Criminal Justice 89 (2023) 102129. doi: 10. 1016/j.jcrimjus.2023.102129

  7. [7]

    Sjöberg, Simulation Exercises in Police Education, Why and How? A Teacher’s Perspective, International Journal for Research in Vocational Education and Training 11 (2024) 460–482

    D. Sjöberg, Simulation Exercises in Police Education, Why and How? A Teacher’s Perspective, International Journal for Research in Vocational Education and Training 11 (2024) 460–482. doi:10. 13152/IJRVET.11.3.6

  8. [8]

    Gelis, S

    A. Gelis, S. Cervello, R. Rey, G. Llorca, P. Lambert, N. Franck, A. Dupeyron, M. Delpont, B. Rolland, Peer Role-Play for Training Communication Skills in Medical Students: A Systematic Review, Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare 15 (2020) 106–111. doi:10.1097/SIH.0000000000000412

  9. [9]

    M. V. Sanchez-Vives, M. Slater, From presence to consciousness through virtual reality, Nature reviews neuroscience 6 (2005) 332–339

  10. [10]

    R. B. H. Tootell, S. L. Zapetis, B. Babadi, Z. Nasiriavanaki, D. E. Hughes, K. Mueser, M. Otto, E. Pace- Schott, D. J. Holt, Psychological and physiological evidence for an initial ‘Rough Sketch’ calculation of personal space, Scientific Reports 11 (2021) 20960. doi:10.1038/s41598-021-99578-1

  11. [11]

    PLOS ONE9(12), 113490 (2014) https://doi.org/10.1371/journal

    J. Marín-Morales, J. L. Higuera-Trujillo, J. Guixeres, C. Llinares, M. Alcañiz, G. Valenza, Heart rate variability analysis for the assessment of immersive emotional arousal using virtual reality: Comparing real and virtual scenarios, PLOS ONE 16 (2021) e0254098. doi: 10.1371/journal. pone.0254098

  12. [12]

    Hodge, G

    R. Hodge, G. Kress, Social Semiotics, Polity Press, Oxford, 1998

  13. [13]

    list.lu/projects/detail/target/, 2025

    TARGET – Training Augmented Reality Generalised Environment Toolkit, https://researchportal. list.lu/projects/detail/target/, 2025. Project webpage. Accessed: 2025-12-19

  14. [14]

    Company website

    RE-liON, Virtual reality training platform, https://re-lion.com, 2025. Company website. Accessed: 2025-12-19

  15. [15]

    Company website

    Refense, Virtual reality training platform, https://www.refense.com, 2025. Company website. Accessed: 2025-12-19

  16. [16]

    Company website

    Street Smarts VR, Virtual Reality Training, https://www.streetsmartsvr.com/vrtraining, 2025. Company website. Accessed: 2025-12-19

  17. [17]

    Company website

    Axon Enterprise, Inc., Axon | Protect Life, https://www.axon.com, 2025. Company website. Ac- cessed: 2025-12-19

  18. [18]

    Company website

    Operator XR, Operator XR, https://operatorxr.com/, 2025. Company website. Accessed: 2025-12-19

  19. [19]

    Company webpage

    HGXR, HOLOFORCE BLUE, https://hgxr.com/holoforce-blue/, 2025. Company webpage. Accessed: 2025-12-19

  20. [20]

    URL: https://static.dashoefer.de/download/vr/ whitepaper_auswertungsparameter_20240717.pdf

    Verlag Dashöfer GmbH, VR EasySpeech – Erläuterung der Parameter, Technical Report, Ver- lag Dashöfer GmbH, Hamburg, Germany, 2024. URL: https://static.dashoefer.de/download/vr/ whitepaper_auswertungsparameter_20240717.pdf

  21. [21]

    Murtinger, J

    M. Murtinger, J. C. Uhl, L. M. Atzmüller, G. Regal, M. Roither, Sound of the Police—Virtual Reality Training for Police Communication for High-Stress Operations, Multimodal Technologies and Interaction 8 (2024) 46. doi:10.3390/mti8060046

  22. [22]

    J. E. Muñoz, J. A. Lavoie, A. T. Pope, Psychophysiological insights and user perspectives: enhancing police de-escalation skills through full-body vr training, Frontiers in Psychology 15 (2024) 1390677

  23. [23]

    T. Baur, I. Damian, P. Gebhard, K. Porayska-Pomsta, E. André, A job interview simulation: Social cue-based interaction with a virtual character, in: 2013 International Conference on Social Computing, 2013, pp. 220–227. doi:10.1109/SocialCom.2013.39

  24. [24]

    T. Baur, I. Damian, F. Lingenfelser, J. Wagner, E. André, NovA: Automated Analysis of Nonverbal Signals in Social Interactions, in: D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y. Vardi, G. Weikum, A. A. Salah, H. Hung, O. Aran,...

  25. [25]

    Bartyzel, M

    P. Bartyzel, M. Igras-Cybulska, D. Hekiert, M. Majdak, G. Łukawski, T. Bohné, S. Tadeja, Exploring user reception of speech-controlled virtual reality environment for voice and public speaking training, Computers & Graphics 126 (2025) 104160. doi:10.1016/j.cag.2024.104160

  26. [26]

    Tauscher, A

    J. Tauscher, A. Witt, S. Bosse, et al., Exploring neural and peripheral physiological correlates of simulator sickness. comput anima virtual worlds 31: e1953, 2020

  27. [27]

    Y. Chen, T. Stephani, M. T. Bagdasarian, A. Hilsmann, P. Eisert, A. Villringer, S. Bosse, M. Gaebler, V. V. Nikulin, Realness of face images can be decoded from non-linear modulation of eeg responses, Scientific Reports 14 (2024) 5683

  28. [28]

    Chanel, C

    G. Chanel, C. Rebetez, M. Bétrancourt, T. Pun, Emotion assessment from physiological signals for adaptation of game difficulty, IEEE Trans. Syst. Man Cybern. A:Syst. Hum. 41 (2011) 1052–1063. doi:10.1109/TSMCA.2011.2116000

  29. [29]

    Gjoreski, I

    H. Gjoreski, I. I. Mavridou, M. Fatoorechi, I. Kiprijanovska, M. Gjoreski, G. Cox, C. Nduka, emteqpro: Face-mounted mask for emotion recognition and affective computing, in: Adjunct Proceedings of the 2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2021 ACM International Symposium on Wearable Computers,...

  30. [30]

    Gnacek, J

    M. Gnacek, J. Broulidakis, I. Mavridou, M. Fatoorechi, E. Seiss, T. Kostoulas, E. Balaguer-Ballester, I. Kiprijanovska, C. Rosten, C. Nduka, Emteqpro—fully integrated biometric sensing array for non-invasive biomedical research in virtual reality, Frontiers in Virtual Reality 3 (2022) 781218. doi:10.3389/frvir.2022.781218

  31. [31]

    Kiprijanovska, B

    I. Kiprijanovska, B. Sazdov, M. Majstoroski, S. Stankoski, M. Gjoreski, C. Nduka, H. Gjoreski, Facial expression recognition using facial mask with emg sensors., in: VR4Health@ MUM, 2022, pp. 23–28

  32. [32]

    Gjoreski, I

    M. Gjoreski, I. Kiprijanovska, S. Stankoski, I. Mavridou, M. J. Broulidakis, H. Gjoreski, C. Nduka, Facial emg sensing for monitoring affect using a wearable device, Scientific Reports 12 (2022) 16876. doi:10.1038/s41598-022-21456-1

  33. [33]

    Hickson, N

    S. Hickson, N. Dufour, A. Sud, V. Kwatra, I. Essa, Eyemotion: Classifying facial expressions in vr using eye-tracking cameras, in: Proc. IEEE WACV, 2019, pp. 1626–1635. doi:10.1109/WACV. 2019.00178

  34. [34]

    Murakami, K

    M. Murakami, K. Kikui, K. Suzuki, F. Nakamura, M. Fukuoka, K. Masai, Y. Sugiura, M. Sugimoto, Affectivehmd: facial expression recognition in head mounted display using embedded photo reflective sensors, in: ACM SIGGRAPH 2019 Emerging Technologies, SIGGRAPH ’19, Association for Computing Machinery, New York, NY, USA, 2019. doi:10.1145/3305367.3335039

  35. [35]

    Numan, F

    N. Numan, F. t. Haar, P. Cesar, Generative rgb-d face completion for head-mounted display removal, in: Proc. IEEE VRW, 2021, pp. 109–116. doi:10.1109/VRW52623.2021.00028

  36. [36]

    Tomotaki-Dawoud, B

    K. Tomotaki-Dawoud, B. Nierula, F. T. Siewe, T. Koch, D. J. Meyer, A. Bock, M. Heinze, D. Knuth, D. Martin, J. Schander, A. Hilsmann, P. Eisert, S. Bosse, Multi-view gesture recognition in conflict situations, in: 2024 International Symposium on Multimedia (ISM), 2024, pp. 267–268. doi: 10. 1109/ISM63611.2024.00060

  37. [37]

    P. L. Indrasiri, B. Kashyap, C. Kolambahewage, B. Nakisa, K. Ijaz, P. N. Pathirana, Vr based emotion recognition using deep multimodal fusion with biosignals across multiple anatomical domains (2024). URL: https://arxiv.org/abs/2412.02283.arXiv:2412.02283

  38. [38]

    E. M. Polo, F. Iacomi, A. V. Rey, D. Ferraris, A. Paglialonga, R. Barbieri, Advancing emotion recognition with virtual reality: A multimodal approach using physiological signals and machine learning, Computers in Biology and Medicine 193 (2025) 110310. doi: https://doi.org/10. 1016/j.compbiomed.2025.110310

  39. [39]

    Nierula, K

    B. Nierula, K. Tomotaki-Dawoud, M. Akguel, M. T. Lafci, D. Przewozny, A. Hilsmann, P. Eisert, S. Bosse, Occlusion-robust multimodal emotion recognition in VR via fusion of facial images and EMG, 2026. Accepted at ACM IUI 2026 Workshop SHAPEXR

  40. [40]

    Kothe, S

    C. Kothe, S. Y. Shirazi, T. Stenner, D. Medine, C. Boulay, M. I. Grivich, F. Artoni, T. Mullen, A. Delorme, S. Makeig, The lab streaming layer for synchronized multimodal recording, Imaging Neuroscience 3 (2025) IMAG.a.136. URL: https://doi.org/10.1162/IMAG.a.136. doi:10.1162/IMAG. a.136, open Access

  41. [41]

    GitHub repository

    LSL Developers, Lab Streaming Layer (LSL), https://github.com/sccn/labstreaminglayer, 2025. GitHub repository. Accessed: 2025-12-19

  42. [42]

    GitHub repository

    SYSTRAN SA, Faster Whisper transcription with CTranslate2, https://github.com/SYSTRAN/ faster-whisper, 2025. GitHub repository. Accessed: 2025-12-19

  43. [43]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust Speech Recognition via Large-Scale Weak Supervision, 2022. doi:10.48550/arXiv.2212.04356

  44. [44]

    Wiśniewski, Z

    D. Wiśniewski, Z. Rostek, A. Nowakowski, Fame-mt dataset: Formality awareness made easy for machine translation purposes., arXiv preprint arXiv:2405.11942. (2024)

  45. [45]

    Nadejde, A

    M. Nadejde, A. Currey, B. Hsu, X. Niu, M. Federico, G. Dinu, Cocoa-mt: A dataset and benchmark for contrastive controlled mt with application to formality., In Findings of the Association for Computational Linguistics: NAACL 2022 (pp. 616-632). (2022)

  46. [46]

    Asghari, F

    H. Asghari, F. Hewett, Hiig at germeval 2022: Best of both worlds ensemble for automatic text complexity assessment, In Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text, pages 15–20 (2022)

  47. [47]

    Naderi, S

    B. Naderi, S. Mohtaj, K. Ensikat, S. Möller, Subjective assessment of text complexity: A dataset for german language., arXiv preprint arXiv:1904.07733. (2019)

  48. [48]

    D. Wu, T. D. Parsons, S. S. Narayanan, Acoustic feature analysis in speech emotion primitives estimation, in: Interspeech 2010, 2010, pp. 785–788. doi:10.21437/Interspeech.2010-285

  49. [49]

    Part 1, Zwicker method, Technical Report, International Organization for Standardization, 2017

    ISO 532-1:2017, Acoustics - Methods for calculating loudness. Part 1, Zwicker method, Technical Report, International Organization for Standardization, 2017

  50. [50]

    G. F. Coop, MOSQITO, 2025. doi:10.5281/zenodo.10629475

  51. [51]

    Mauch, S

    M. Mauch, S. Dixon, Pyin: A fundamental frequency estimator using probabilistic threshold distributions, in: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 659–663. doi:10.1109/ICASSP.2014.6853678

  52. [52]

    Wagner, A

    J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, B. W. Schuller, Dawn of the transformer era in speech emotion recognition: Closing the valence gap, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2023) 10745–10759. doi:10.1109/ tpami.2023.3263585

  53. [53]

    Bazarevsky, I

    V. Bazarevsky, I. Grishchenko, K. Raveendran, T. L. Zhu, F. Zhang, M. Grundmann, Blazepose: On-device real-time body pose tracking, ArXiv abs/2006.10204 (2020)

  54. [54]

    Grishchenko, V

    I. Grishchenko, V. Bazarevsky, A. Zanfir, E. G. Bazavan, M. Zanfir, R. Yee, K. Raveendran, M. Zh- danovich, M. Grundmann, C. Sminchisescu, Blazepose ghum holistic: Real-time 3d human land- marks and pose estimation, 2022.arXiv:2206.11678

  55. [55]

    H. Xu, E. G. Bazavan, A. Zanfir, W. T. Freeman, R. Sukthankar, C. Sminchisescu, Ghum & ghuml: Generative 3d human shape and articulated pose models, in: IEEE-CVPR, 2020, pp. 6183–6192. doi:10.1109/CVPR42600.2020.00622

  56. [56]

    Ekman, An argument for basic emotions, Cognition and Emotion 6 (1992) 169–200

    P. Ekman, An argument for basic emotions, Cognition and Emotion 6 (1992) 169–200. Publisher: Routledge

  57. [57]

    V. V. Nikulin, G. Nolte, G. Curio, A novel method for reliable and fast extraction of neuronal EEG/MEG oscillations on the basis of spatio-spectral decomposition, NeuroImage 55 (2011) 1528–1535. doi:10.1016/j.neuroimage.2011.01.057

  58. [58]

    Dähne, F

    S. Dähne, F. C. Meinecke, S. Haufe, J. Höhne, M. Tangermann, K.-R. Müller, V. V. Nikulin, SPoC: A novel framework for relating the amplitude of neuronal oscillations to behaviorally relevant parameters, NeuroImage 86 (2014) 111–122. doi:10.1016/j.neuroimage.2013.07.079

  59. [59]

    S. M. Hofmann, F. Klotzsche, A. Mariola, V. Nikulin, A. Villringer, M. Gaebler, Decoding subjective emotional arousal from EEG during an immersive virtual reality experience, eLife 10 (2021) e64812. doi:10.7554/eLife.64812

  60. [60]

    Nierula, M

    B. Nierula, M. T. Lafci, A. Melnik, M. Akgül, F. T. Siewe, S. Bosse, Differential Physiological Responses to Proxemic and Facial Threats in Virtual Avatar Interactions, 2025. doi:10.48550/ ARXIV.2508.10586

  61. [61]

    Flewitt, S

    R. Flewitt, S. Price, T. Korkiakangas, Multimodality: Methodological explorations, Qualitative Research 19 (2018) 3–6. doi:10.1177/1468794118817414

  62. [62]

    Blumer, Symbolic interactionism: Perspective and method, Prentice Hall, Englewood Cliffs, NJ, 1969

    H. Blumer, Symbolic interactionism: Perspective and method, Prentice Hall, Englewood Cliffs, NJ, 1969

  63. [63]

    L. D. Keesman, D. Weenink, Feel it coming: Situational turning points in police-civilian encounters, Historical Social Research 47 (2022) 88–110. doi:10.12759/hsr.47.2022.04

  64. [64]

    H. M. Sunde, How does it end well? an interview study of police officers’ perceptions of de- escalation, Nordic Journal of Studies in Policing 11 (2024) 1–21. URL: https://doi.org/10.18261/njsp. 11.1.1. doi:10.18261/njsp.11.1.1

  65. [65]

    J. Hu, L. Mathur, P. P. Liang, L.-P. Morency, Openface 3.0: A lightweight multitask system for comprehensive facial behavior analysis, arXiv preprint arXiv:2506.02891 (2025)

  66. [66]

    Nierula, M

    B. Nierula, M. T. Lafci, A. Melnik, E. Gaudinot, N. Karuzin, S. Bosse, Personal space in virtual reality interactions assessed with electroencephalography and skin conductance, in: Neuroscience 2025, San Diego, California, United States, 2025

  67. [67]

    M. T. Lafci, B. Nierula, D. Damar, S. Bosse, Too Close for Comfort? Investigating Virtual Professor Distance and Student Learning in VR, in: IEEE SMC, Vienna, 2025