arxiv: 2604.11570 · v1 · submitted 2026-04-13 · 💻 cs.HC · cs.MM

Recognition: unknown

From Multimodal Signals to Adaptive XR Experiences for De-escalation Training

Birgit Nierula , Karam Tomotaki-Dawoud , Daniel Johannes Meyer , Iryna Ignatieva , Mina Mottahedin , Thomas Koch , Sebastian Bosse

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3

classification 💻 cs.HC cs.MM

keywords multimodal signalsXR trainingde-escalationadaptive systemslaw enforcement trainingsignal fusionEEGphysiological monitoring

0 comments

The pith

Five synchronized multimodal streams enable real-time adaptation in XR de-escalation training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes an early-stage multimodal system designed as a foundation for adaptive XR training in de-escalation scenarios for law enforcement. It integrates five parallel processing streams—verbal and prosodic speech, skeletal gestures from multi-view cameras, affective analysis combining video and EMG, EEG mental state decoding, and physiological arousal from skin conductance and heart data—all synchronized via Lab Streaming Layer for continuous, temporally aligned assessments of conscious and unconscious cues. An interpretation layer then connects these low-level signals to constructs like escalation and de-escalation, drawing on domain knowledge from police instructors and lay participants to ground the system in realistic conflict situations. The work reports preliminary results on cue extraction feasibility while emphasizing that effective fusion and feedback require design choices rather than purely technical solutions.

Core claim

The authors present the design and implementation of a system integrating five parallel processing streams—verbal and prosodic speech analysis, skeletal gesture recognition from multi-view RGB cameras, multimodal affective analysis combining lower-face video with upper-face facial EMG, EEG-based mental state decoding, and physiological arousal estimation from skin conductance, heart activity, and proxemic behavior—synchronized via Lab Streaming Layer to enable temporally aligned, continuous assessments of users' conscious and unconscious communication cues, with an interpretation layer informed by domain knowledge from police instructors and lay participants to link signals to interactional

What carries the argument

The central mechanism is the Lab Streaming Layer synchronization of five multimodal streams combined with an interpretation layer that maps low-level signal representations to constructs such as escalation and de-escalation using domain knowledge.

If this is right

Multi-view sensing and multimodal fusion overcome occlusion and viewpoint challenges from head-mounted displays in gesture and emotion recognition.
Automated cue extraction works for verbal assessment, mental state decoding, and physiological arousal in XR training settings.
Fusion and feedback must be handled as design problems in human-AI XR training rather than purely technical tasks.
The approach supplies design resources and empirical insights for adaptive training in complex interpersonal conflict scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time responses to unconscious physiological and EEG cues could let trainees practice handling stress signals that a human instructor might overlook.
Extending the system to new training domains would require fresh domain-specific knowledge to keep the interpretation layer accurate.
Mismatches between the interpretation layer and actual field outcomes could arise if the grounding from instructors does not generalize beyond the tested scenarios.

Load-bearing premise

Domain knowledge from police instructors and lay participants can reliably ground the interpretation layer that links low-level signals to escalation and de-escalation constructs in realistic scenarios.

What would settle it

A side-by-side comparison in which the system's real-time cue assessments and adaptive responses in XR scenarios are scored against independent expert ratings by police instructors for alignment on escalation levels.

Figures

Figures reproduced from arXiv: 2604.11570 by Birgit Nierula, Daniel Johannes Meyer, Iryna Ignatieva, Karam Tomotaki-Dawoud, Mina Mottahedin, Sebastian Bosse, Thomas Koch.

**Figure 1.** Figure 1: Conceptual system overview of the extraction layer integrated into an adaptive XR experience. The system objectively assesses user experiences during virtual conflicts by extracting features from signal modalities and fusing them with avatar behaviour and storybook context. The resulting signal adapts the virtual experience. We propose an integrated pipeline that extracts communication cues from multimoda… view at source ↗

**Figure 2.** Figure 2: Integrated dual-stream pipeline combining gesture and emotion recognition for VR training feedback. The emotion stream (top) fuses lower-face video with upper-face EMG under HMD occlusion; the gesture stream (bottom) processes multi-view skeletal data to detect conflict-relevant body language. Both streams converge into a unified weighted output that delivers real-time, multimodal feedback for skill refine… view at source ↗

**Figure 3.** Figure 3: Mean values of length and loudness over answers of N=20 participants with an indicated linear regression. Our investigations into automated recognition of communication cues for conflict-related training scenarios yielded promising results across six complementary domains: For verbal communication, pilot test data was collected from N=20 police officers interacting in a prototypical VR scene for de-… view at source ↗

**Figure 4.** Figure 4: Activation patterns in 3 participants related to emotional arousal adapted from [66]. For bodily arousal, physiological markers of emotional arousal were tested in an experimental scenario where the avatar did or did not respect the user’s personal space and showed a neutral or angry emotional facial expression [60]. Multimodal assessment revealed that our analysis methods were sufficiently sensitive to de… view at source ↗

**Figure 5.** Figure 5: Physiological results adapted from [60]. (A) Discomfort ratings to an avatar that respected the user’s personal space (light green) or violated it (dark green) with neutral or angry facial expression. (B) Effect on SCR amplitudes. There was only a main effect of Personal Space on SCR amplitudes. (C) Interaction effect of the avatar phase (whether approaching or standing in front of the user) on SCR amplitu… view at source ↗

read the original abstract

We present the early-stage design and implementation of a multimodal, real-time communication analysis system intended as a foundational interaction layer for adaptive VR training. The system integrates five parallel processing streams: (1) verbal and prosodic speech analysis, (2) skeletal gesture recognition from multi-view RGB cameras, (3) multimodal affective analysis combining lower-face video with upper-face facial EMG, (4) EEG-based mental state decoding, and (5) physiological arousal estimation from skin conductance, heart activity, and proxemic behavior. All signals are synchronized via Lab Streaming Layer to enable temporally aligned, continuous assessments of users' conscious and unconscious communication cues. Building on concepts from social semiotics and symbolic interactionism, we introduce an interpretation layer that links low-level signal representations to interactional constructs such as escalation and de-escalation. This layer is informed by domain knowledge from police instructors and lay participants, grounding system responses in realistic conflict scenarios. We demonstrate the feasibility and limitations of automated cue extraction in an XR-based de-escalation training project for law enforcement, reporting preliminary results for gesture recognition, emotion recognition under HMD occlusion, verbal assessment, mental state decoding, and physiological arousal. Our findings highlight the value of multi-view sensing and multimodal fusion for overcoming occlusion and viewpoint challenges, while underscoring that fusion and feedback must be treated as design problems rather than purely technical ones. The work contributes design resources and empirical insights for shaping human-AI-powered XR training in complex interpersonal settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A concrete early prototype for multimodal XR de-escalation training with a social-semiotics layer, but no validation data on the fusion or adaptation performance.

read the letter

The main thing here is a prototype that pulls together five synchronized streams—verbal, gesture, affective, EEG, and physiological—for adaptive XR de-escalation training, with an added interpretation layer drawn from social semiotics and expert input. This combination of streams and the semiotics framing is what sets it apart from standard multimodal sensing papers. The paper does a decent job describing the system architecture and the use of Lab Streaming Layer for timing. It also gives preliminary numbers on the separate components, like gesture recognition and emotion detection despite the headset, and flags real issues with occlusion and the need for design thinking on fusion. Credit to them for not overclaiming the results. Where it falls short is on the interpretation layer itself. The description is conceptual, informed by police instructors, but there are no metrics showing how well the fused signals match actual escalation levels or drive the XR adaptations. No cross-stream results or closed-loop tests appear, which leaves the adaptive part unproven. This paper is aimed at people in HCI and XR who work on training simulations for high-stakes communication. Someone building similar systems could borrow the component list and the reminder that fusion is not just technical. It is not ready for practitioners yet. It deserves peer review. The core idea is solid enough to warrant feedback on how to validate the high-level mapping, and the authors seem open about the current limits.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the early-stage design and implementation of a multimodal real-time communication analysis system for adaptive XR de-escalation training. It integrates five parallel processing streams (verbal/prosodic, skeletal gesture from multi-view RGB, multimodal affective with lower-face video and upper-face EMG, EEG mental state decoding, and physiological arousal from skin conductance/heart/proxemics) synchronized via Lab Streaming Layer, introduces a domain-informed interpretation layer mapping low-level signals to interactional constructs such as escalation and de-escalation based on police instructor and lay participant knowledge, and reports preliminary feasibility results for individual component accuracies in an XR law enforcement training context while noting limitations of fusion and occlusion challenges.

Significance. If the interpretation layer can be empirically validated, the work could provide useful design resources and empirical insights for human-AI XR training systems in high-stakes interpersonal scenarios, particularly by demonstrating multi-view sensing to address occlusion and by framing multimodal fusion as a design problem informed by social semiotics and symbolic interactionism. The preliminary component results and explicit acknowledgment of domain grounding add practical value for future closed-loop adaptive systems.

major comments (2)

[Interpretation layer and preliminary results description] The central claim that the system enables temporally aligned, continuous assessments of conscious and unconscious cues to drive adaptive XR responses rests on the interpretation layer. However, the manuscript describes this layer conceptually and reports only isolated preliminary accuracies for individual streams (gesture recognition, emotion under HMD occlusion, verbal assessment, EEG decoding, physiological arousal) with no quantitative results on cross-stream fusion performance, agreement with expert ratings of escalation/de-escalation constructs, or closed-loop adaptation efficacy. This leaves the mapping from low-level signals to actionable high-level constructs untested.
[Domain knowledge grounding section] The weakest assumption—that domain knowledge from police instructors and lay participants can reliably ground the interpretation layer in realistic scenarios—is stated but not validated through any reported inter-rater agreement, scenario coverage metrics, or comparison to expert-labeled interactions. This is load-bearing for claims of realistic conflict scenario grounding.

minor comments (2)

[Abstract] The abstract states that preliminary results are reported but does not include specific accuracy values, error bars, or a summary table; adding these in the results section would improve clarity and allow readers to assess component feasibility directly.
[System architecture description] Notation for the five streams and LSL synchronization is introduced without a diagram or explicit timing alignment equation; a figure illustrating the parallel streams and synchronization would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the insightful and constructive comments on our early-stage prototype manuscript. We appreciate the recognition of the system's potential and the clear identification of areas needing stronger framing around its preliminary scope. We address each major comment below and indicate planned revisions to clarify limitations without overstating current results.

read point-by-point responses

Referee: The central claim that the system enables temporally aligned, continuous assessments of conscious and unconscious cues to drive adaptive XR responses rests on the interpretation layer. However, the manuscript describes this layer conceptually and reports only isolated preliminary accuracies for individual streams (gesture recognition, emotion under HMD occlusion, verbal assessment, EEG decoding, physiological arousal) with no quantitative results on cross-stream fusion performance, agreement with expert ratings of escalation/de-escalation constructs, or closed-loop adaptation efficacy. This leaves the mapping from low-level signals to actionable high-level constructs untested.

Authors: We agree that the interpretation layer is central to enabling adaptive responses and that the reported results are limited to isolated component accuracies, with no fusion performance metrics, expert agreement data, or closed-loop efficacy results included. This manuscript presents an early-stage design and implementation study focused on multimodal integration via Lab Streaming Layer and feasibility of individual streams in an XR law enforcement training context. The layer is introduced conceptually, informed by social semiotics, symbolic interactionism, and initial domain input. In revision, we will expand the description of the layer with more concrete examples of mappings to escalation/de-escalation constructs, explicitly frame the work as preliminary, and discuss the lack of fusion and closed-loop validation as a key limitation and future research direction. We cannot add new quantitative fusion or adaptation results, as these were not part of the current prototype evaluation. revision: partial
Referee: The weakest assumption—that domain knowledge from police instructors and lay participants can reliably ground the interpretation layer in realistic scenarios—is stated but not validated through any reported inter-rater agreement, scenario coverage metrics, or comparison to expert-labeled interactions. This is load-bearing for claims of realistic conflict scenario grounding.

Authors: We acknowledge that the domain knowledge grounding is a foundational assumption and that the manuscript does not report quantitative validation such as inter-rater agreement, scenario coverage, or comparisons to expert-labeled data. The consultations with police instructors and lay participants were used to shape the interactional constructs, but formal validation metrics were outside the scope of this initial implementation paper. In the revised manuscript, we will provide greater detail on the knowledge elicitation process and more explicitly note the absence of these validation measures as a limitation, while identifying it as an important area for subsequent work. We cannot supply the requested agreement or coverage statistics without new data collection. revision: partial

standing simulated objections not resolved

Quantitative results on cross-stream fusion performance, agreement with expert ratings of escalation/de-escalation constructs, or closed-loop adaptation efficacy
Inter-rater agreement, scenario coverage metrics, or comparisons to expert-labeled interactions for domain knowledge grounding of the interpretation layer

Circularity Check

0 steps flagged

No circularity: systems description without derivations or self-referential predictions

full rationale

The manuscript is an early-stage systems description of a multimodal XR training platform. It details five synchronized signal streams, Lab Streaming Layer integration, and a conceptually introduced interpretation layer grounded in external domain knowledge from police instructors and participants. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text or abstract. The central claims rest on feasibility demonstrations of individual components and design insights rather than any chain that reduces to its own inputs by construction. Self-citations are absent from the load-bearing elements, and the work does not rename known results or smuggle ansatzes. This is a standard non-circular engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard multimodal processing pipelines plus one domain assumption about expert knowledge grounding the interpretation layer; no free parameters or new entities are introduced.

axioms (1)

domain assumption Domain knowledge from police instructors and lay participants accurately informs the mapping from low-level signals to interactional constructs such as escalation and de-escalation.
Invoked in the description of the interpretation layer informed by domain knowledge.

pith-pipeline@v0.9.0 · 5593 in / 1161 out tokens · 34562 ms · 2026-05-10T15:38:27.668486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 39 canonical work pages · 1 internal anchor

[1]

Jewitt (Ed.), The Routledge Handbook of Multimodal Analysis, 2 ed., Routledge, London, 2014

C. Jewitt (Ed.), The Routledge Handbook of Multimodal Analysis, 2 ed., Routledge, London, 2014

2014
[2]

Otu, Decoding nonverbal communication in law enforcement, Salus Journal 3 (2023) 1–16

N. Otu, Decoding nonverbal communication in law enforcement, Salus Journal 3 (2023) 1–16. URL: https://journals.csu.domains/index.php/salusjournal/article/view/42

2023
[3]

J. L. Lakin, Automatic cognitive processes and nonverbal communication, in: V. Manusov, M. L. Patterson (Eds.), The SAGE Handbook of Nonverbal Communication, SAGE Publications, Inc., 2006, pp. 59–78. doi:10.4135/9781412976152.n4

work page doi:10.4135/9781412976152.n4 2006
[4]

R. M. Rozelle, J. C. Baxter, The interpretation of nonverbal behavior in a role-defined interaction sequence: The police–citizen encounter, Journal of Nonverbal Behavior 2 (1978) 167–180. doi: 10. 1007/BF01145819

1978
[5]

nutrition label

M. Schmid Mast, G. Cousin, The role of nonverbal communication in medical interactions: Empir- ical results, theoretical bases, and methodological issues, in: L. R. Martin, M. R. DiMatteo (Eds.), The Oxford Handbook of Health Communication, Behavior Change, and Treatment Adherence, Oxford Library of Psychology, Oxford University Press, Oxford, 2013. doi:1...

work page doi:10.1093/oxfordhb/ 2013
[6]

M. D. White, C. Orosco, S. Watts, Beyond force and injuries: Examining alternative (and important) outcomes for police De-escalation training, Journal of Criminal Justice 89 (2023) 102129. doi: 10. 1016/j.jcrimjus.2023.102129

work page arXiv 2023
[7]

Sjöberg, Simulation Exercises in Police Education, Why and How? A Teacher’s Perspective, International Journal for Research in Vocational Education and Training 11 (2024) 460–482

D. Sjöberg, Simulation Exercises in Police Education, Why and How? A Teacher’s Perspective, International Journal for Research in Vocational Education and Training 11 (2024) 460–482. doi:10. 13152/IJRVET.11.3.6

2024
[8]

Gelis, S

A. Gelis, S. Cervello, R. Rey, G. Llorca, P. Lambert, N. Franck, A. Dupeyron, M. Delpont, B. Rolland, Peer Role-Play for Training Communication Skills in Medical Students: A Systematic Review, Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare 15 (2020) 106–111. doi:10.1097/SIH.0000000000000412

work page doi:10.1097/sih.0000000000000412 2020
[9]

M. V. Sanchez-Vives, M. Slater, From presence to consciousness through virtual reality, Nature reviews neuroscience 6 (2005) 332–339

2005
[10]

R. B. H. Tootell, S. L. Zapetis, B. Babadi, Z. Nasiriavanaki, D. E. Hughes, K. Mueser, M. Otto, E. Pace- Schott, D. J. Holt, Psychological and physiological evidence for an initial ‘Rough Sketch’ calculation of personal space, Scientific Reports 11 (2021) 20960. doi:10.1038/s41598-021-99578-1

work page doi:10.1038/s41598-021-99578-1 2021
[11]

PLOS ONE9(12), 113490 (2014) https://doi.org/10.1371/journal

J. Marín-Morales, J. L. Higuera-Trujillo, J. Guixeres, C. Llinares, M. Alcañiz, G. Valenza, Heart rate variability analysis for the assessment of immersive emotional arousal using virtual reality: Comparing real and virtual scenarios, PLOS ONE 16 (2021) e0254098. doi: 10.1371/journal. pone.0254098

work page doi:10.1371/journal 2021
[12]

Hodge, G

R. Hodge, G. Kress, Social Semiotics, Polity Press, Oxford, 1998

1998
[13]

list.lu/projects/detail/target/, 2025

TARGET – Training Augmented Reality Generalised Environment Toolkit, https://researchportal. list.lu/projects/detail/target/, 2025. Project webpage. Accessed: 2025-12-19

2025
[14]

Company website

RE-liON, Virtual reality training platform, https://re-lion.com, 2025. Company website. Accessed: 2025-12-19

2025
[15]

Company website

Refense, Virtual reality training platform, https://www.refense.com, 2025. Company website. Accessed: 2025-12-19

2025
[16]

Company website

Street Smarts VR, Virtual Reality Training, https://www.streetsmartsvr.com/vrtraining, 2025. Company website. Accessed: 2025-12-19

2025
[17]

Company website

Axon Enterprise, Inc., Axon | Protect Life, https://www.axon.com, 2025. Company website. Ac- cessed: 2025-12-19

2025
[18]

Company website

Operator XR, Operator XR, https://operatorxr.com/, 2025. Company website. Accessed: 2025-12-19

2025
[19]

Company webpage

HGXR, HOLOFORCE BLUE, https://hgxr.com/holoforce-blue/, 2025. Company webpage. Accessed: 2025-12-19

2025
[20]

URL: https://static.dashoefer.de/download/vr/ whitepaper_auswertungsparameter_20240717.pdf

Verlag Dashöfer GmbH, VR EasySpeech – Erläuterung der Parameter, Technical Report, Ver- lag Dashöfer GmbH, Hamburg, Germany, 2024. URL: https://static.dashoefer.de/download/vr/ whitepaper_auswertungsparameter_20240717.pdf

2024
[21]

Murtinger, J

M. Murtinger, J. C. Uhl, L. M. Atzmüller, G. Regal, M. Roither, Sound of the Police—Virtual Reality Training for Police Communication for High-Stress Operations, Multimodal Technologies and Interaction 8 (2024) 46. doi:10.3390/mti8060046

work page doi:10.3390/mti8060046 2024
[22]

J. E. Muñoz, J. A. Lavoie, A. T. Pope, Psychophysiological insights and user perspectives: enhancing police de-escalation skills through full-body vr training, Frontiers in Psychology 15 (2024) 1390677

2024
[23]

T. Baur, I. Damian, P. Gebhard, K. Porayska-Pomsta, E. André, A job interview simulation: Social cue-based interaction with a virtual character, in: 2013 International Conference on Social Computing, 2013, pp. 220–227. doi:10.1109/SocialCom.2013.39

work page doi:10.1109/socialcom.2013.39 2013
[24]

T. Baur, I. Damian, F. Lingenfelser, J. Wagner, E. André, NovA: Automated Analysis of Nonverbal Signals in Social Interactions, in: D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y. Vardi, G. Weikum, A. A. Salah, H. Hung, O. Aran,...

work page doi:10.1007/978-3-319-02714-2_14 2013
[25]

Bartyzel, M

P. Bartyzel, M. Igras-Cybulska, D. Hekiert, M. Majdak, G. Łukawski, T. Bohné, S. Tadeja, Exploring user reception of speech-controlled virtual reality environment for voice and public speaking training, Computers & Graphics 126 (2025) 104160. doi:10.1016/j.cag.2024.104160

work page doi:10.1016/j.cag.2024.104160 2025
[26]

Tauscher, A

J. Tauscher, A. Witt, S. Bosse, et al., Exploring neural and peripheral physiological correlates of simulator sickness. comput anima virtual worlds 31: e1953, 2020

2020
[27]

Y. Chen, T. Stephani, M. T. Bagdasarian, A. Hilsmann, P. Eisert, A. Villringer, S. Bosse, M. Gaebler, V. V. Nikulin, Realness of face images can be decoded from non-linear modulation of eeg responses, Scientific Reports 14 (2024) 5683

2024
[28]

Chanel, C

G. Chanel, C. Rebetez, M. Bétrancourt, T. Pun, Emotion assessment from physiological signals for adaptation of game difficulty, IEEE Trans. Syst. Man Cybern. A:Syst. Hum. 41 (2011) 1052–1063. doi:10.1109/TSMCA.2011.2116000

work page doi:10.1109/tsmca.2011.2116000 2011
[29]

Gjoreski, I

H. Gjoreski, I. I. Mavridou, M. Fatoorechi, I. Kiprijanovska, M. Gjoreski, G. Cox, C. Nduka, emteqpro: Face-mounted mask for emotion recognition and affective computing, in: Adjunct Proceedings of the 2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2021 ACM International Symposium on Wearable Computers,...

work page arXiv 2021
[30]

Gnacek, J

M. Gnacek, J. Broulidakis, I. Mavridou, M. Fatoorechi, E. Seiss, T. Kostoulas, E. Balaguer-Ballester, I. Kiprijanovska, C. Rosten, C. Nduka, Emteqpro—fully integrated biometric sensing array for non-invasive biomedical research in virtual reality, Frontiers in Virtual Reality 3 (2022) 781218. doi:10.3389/frvir.2022.781218

work page doi:10.3389/frvir.2022.781218 2022
[31]

Kiprijanovska, B

I. Kiprijanovska, B. Sazdov, M. Majstoroski, S. Stankoski, M. Gjoreski, C. Nduka, H. Gjoreski, Facial expression recognition using facial mask with emg sensors., in: VR4Health@ MUM, 2022, pp. 23–28

2022
[32]

Gjoreski, I

M. Gjoreski, I. Kiprijanovska, S. Stankoski, I. Mavridou, M. J. Broulidakis, H. Gjoreski, C. Nduka, Facial emg sensing for monitoring affect using a wearable device, Scientific Reports 12 (2022) 16876. doi:10.1038/s41598-022-21456-1

work page doi:10.1038/s41598-022-21456-1 2022
[33]

Hickson, N

S. Hickson, N. Dufour, A. Sud, V. Kwatra, I. Essa, Eyemotion: Classifying facial expressions in vr using eye-tracking cameras, in: Proc. IEEE WACV, 2019, pp. 1626–1635. doi:10.1109/WACV. 2019.00178

work page doi:10.1109/wacv 2019
[34]

Murakami, K

M. Murakami, K. Kikui, K. Suzuki, F. Nakamura, M. Fukuoka, K. Masai, Y. Sugiura, M. Sugimoto, Affectivehmd: facial expression recognition in head mounted display using embedded photo reflective sensors, in: ACM SIGGRAPH 2019 Emerging Technologies, SIGGRAPH ’19, Association for Computing Machinery, New York, NY, USA, 2019. doi:10.1145/3305367.3335039

work page doi:10.1145/3305367.3335039 2019
[35]

Numan, F

N. Numan, F. t. Haar, P. Cesar, Generative rgb-d face completion for head-mounted display removal, in: Proc. IEEE VRW, 2021, pp. 109–116. doi:10.1109/VRW52623.2021.00028

work page doi:10.1109/vrw52623.2021.00028 2021
[36]

Tomotaki-Dawoud, B

K. Tomotaki-Dawoud, B. Nierula, F. T. Siewe, T. Koch, D. J. Meyer, A. Bock, M. Heinze, D. Knuth, D. Martin, J. Schander, A. Hilsmann, P. Eisert, S. Bosse, Multi-view gesture recognition in conflict situations, in: 2024 International Symposium on Multimedia (ISM), 2024, pp. 267–268. doi: 10. 1109/ISM63611.2024.00060

work page arXiv 2024
[37]

P. L. Indrasiri, B. Kashyap, C. Kolambahewage, B. Nakisa, K. Ijaz, P. N. Pathirana, Vr based emotion recognition using deep multimodal fusion with biosignals across multiple anatomical domains (2024). URL: https://arxiv.org/abs/2412.02283.arXiv:2412.02283

work page arXiv 2024
[38]

E. M. Polo, F. Iacomi, A. V. Rey, D. Ferraris, A. Paglialonga, R. Barbieri, Advancing emotion recognition with virtual reality: A multimodal approach using physiological signals and machine learning, Computers in Biology and Medicine 193 (2025) 110310. doi: https://doi.org/10. 1016/j.compbiomed.2025.110310

work page arXiv 2025
[39]

Nierula, K

B. Nierula, K. Tomotaki-Dawoud, M. Akguel, M. T. Lafci, D. Przewozny, A. Hilsmann, P. Eisert, S. Bosse, Occlusion-robust multimodal emotion recognition in VR via fusion of facial images and EMG, 2026. Accepted at ACM IUI 2026 Workshop SHAPEXR

2026
[40]

Kothe, S

C. Kothe, S. Y. Shirazi, T. Stenner, D. Medine, C. Boulay, M. I. Grivich, F. Artoni, T. Mullen, A. Delorme, S. Makeig, The lab streaming layer for synchronized multimodal recording, Imaging Neuroscience 3 (2025) IMAG.a.136. URL: https://doi.org/10.1162/IMAG.a.136. doi:10.1162/IMAG. a.136, open Access

work page doi:10.1162/imag.a.136 2025
[41]

GitHub repository

LSL Developers, Lab Streaming Layer (LSL), https://github.com/sccn/labstreaminglayer, 2025. GitHub repository. Accessed: 2025-12-19

2025
[42]

GitHub repository

SYSTRAN SA, Faster Whisper transcription with CTranslate2, https://github.com/SYSTRAN/ faster-whisper, 2025. GitHub repository. Accessed: 2025-12-19

2025
[43]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust Speech Recognition via Large-Scale Weak Supervision, 2022. doi:10.48550/arXiv.2212.04356

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.04356 2022
[44]

Wiśniewski, Z

D. Wiśniewski, Z. Rostek, A. Nowakowski, Fame-mt dataset: Formality awareness made easy for machine translation purposes., arXiv preprint arXiv:2405.11942. (2024)

work page arXiv 2024
[45]

Nadejde, A

M. Nadejde, A. Currey, B. Hsu, X. Niu, M. Federico, G. Dinu, Cocoa-mt: A dataset and benchmark for contrastive controlled mt with application to formality., In Findings of the Association for Computational Linguistics: NAACL 2022 (pp. 616-632). (2022)

2022
[46]

Asghari, F

H. Asghari, F. Hewett, Hiig at germeval 2022: Best of both worlds ensemble for automatic text complexity assessment, In Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text, pages 15–20 (2022)

2022
[47]

Naderi, S

B. Naderi, S. Mohtaj, K. Ensikat, S. Möller, Subjective assessment of text complexity: A dataset for german language., arXiv preprint arXiv:1904.07733. (2019)

work page arXiv 1904
[48]

D. Wu, T. D. Parsons, S. S. Narayanan, Acoustic feature analysis in speech emotion primitives estimation, in: Interspeech 2010, 2010, pp. 785–788. doi:10.21437/Interspeech.2010-285

work page doi:10.21437/interspeech.2010-285 2010
[49]

Part 1, Zwicker method, Technical Report, International Organization for Standardization, 2017

ISO 532-1:2017, Acoustics - Methods for calculating loudness. Part 1, Zwicker method, Technical Report, International Organization for Standardization, 2017

2017
[50]

G. F. Coop, MOSQITO, 2025. doi:10.5281/zenodo.10629475

work page doi:10.5281/zenodo.10629475 2025
[51]

Mauch, S

M. Mauch, S. Dixon, Pyin: A fundamental frequency estimator using probabilistic threshold distributions, in: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 659–663. doi:10.1109/ICASSP.2014.6853678

work page doi:10.1109/icassp.2014.6853678 2014
[52]

Wagner, A

J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, B. W. Schuller, Dawn of the transformer era in speech emotion recognition: Closing the valence gap, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2023) 10745–10759. doi:10.1109/ tpami.2023.3263585

work page arXiv 2023
[53]

Bazarevsky, I

V. Bazarevsky, I. Grishchenko, K. Raveendran, T. L. Zhu, F. Zhang, M. Grundmann, Blazepose: On-device real-time body pose tracking, ArXiv abs/2006.10204 (2020)

work page arXiv 2006
[54]

Grishchenko, V

I. Grishchenko, V. Bazarevsky, A. Zanfir, E. G. Bazavan, M. Zanfir, R. Yee, K. Raveendran, M. Zh- danovich, M. Grundmann, C. Sminchisescu, Blazepose ghum holistic: Real-time 3d human land- marks and pose estimation, 2022.arXiv:2206.11678

work page arXiv 2022
[55]

H. Xu, E. G. Bazavan, A. Zanfir, W. T. Freeman, R. Sukthankar, C. Sminchisescu, Ghum & ghuml: Generative 3d human shape and articulated pose models, in: IEEE-CVPR, 2020, pp. 6183–6192. doi:10.1109/CVPR42600.2020.00622

work page doi:10.1109/cvpr42600.2020.00622 2020
[56]

Ekman, An argument for basic emotions, Cognition and Emotion 6 (1992) 169–200

P. Ekman, An argument for basic emotions, Cognition and Emotion 6 (1992) 169–200. Publisher: Routledge

1992
[57]

V. V. Nikulin, G. Nolte, G. Curio, A novel method for reliable and fast extraction of neuronal EEG/MEG oscillations on the basis of spatio-spectral decomposition, NeuroImage 55 (2011) 1528–1535. doi:10.1016/j.neuroimage.2011.01.057

work page doi:10.1016/j.neuroimage.2011.01.057 2011
[58]

Dähne, F

S. Dähne, F. C. Meinecke, S. Haufe, J. Höhne, M. Tangermann, K.-R. Müller, V. V. Nikulin, SPoC: A novel framework for relating the amplitude of neuronal oscillations to behaviorally relevant parameters, NeuroImage 86 (2014) 111–122. doi:10.1016/j.neuroimage.2013.07.079

work page doi:10.1016/j.neuroimage.2013.07.079 2014
[59]

S. M. Hofmann, F. Klotzsche, A. Mariola, V. Nikulin, A. Villringer, M. Gaebler, Decoding subjective emotional arousal from EEG during an immersive virtual reality experience, eLife 10 (2021) e64812. doi:10.7554/eLife.64812

work page doi:10.7554/elife.64812 2021
[60]

Nierula, M

B. Nierula, M. T. Lafci, A. Melnik, M. Akgül, F. T. Siewe, S. Bosse, Differential Physiological Responses to Proxemic and Facial Threats in Virtual Avatar Interactions, 2025. doi:10.48550/ ARXIV.2508.10586

work page arXiv 2025
[61]

Flewitt, S

R. Flewitt, S. Price, T. Korkiakangas, Multimodality: Methodological explorations, Qualitative Research 19 (2018) 3–6. doi:10.1177/1468794118817414

work page doi:10.1177/1468794118817414 2018
[62]

Blumer, Symbolic interactionism: Perspective and method, Prentice Hall, Englewood Cliffs, NJ, 1969

H. Blumer, Symbolic interactionism: Perspective and method, Prentice Hall, Englewood Cliffs, NJ, 1969

1969
[63]

L. D. Keesman, D. Weenink, Feel it coming: Situational turning points in police-civilian encounters, Historical Social Research 47 (2022) 88–110. doi:10.12759/hsr.47.2022.04

work page doi:10.12759/hsr.47.2022.04 2022
[64]

H. M. Sunde, How does it end well? an interview study of police officers’ perceptions of de- escalation, Nordic Journal of Studies in Policing 11 (2024) 1–21. URL: https://doi.org/10.18261/njsp. 11.1.1. doi:10.18261/njsp.11.1.1

work page doi:10.18261/njsp 2024
[65]

J. Hu, L. Mathur, P. P. Liang, L.-P. Morency, Openface 3.0: A lightweight multitask system for comprehensive facial behavior analysis, arXiv preprint arXiv:2506.02891 (2025)

work page arXiv 2025
[66]

Nierula, M

B. Nierula, M. T. Lafci, A. Melnik, E. Gaudinot, N. Karuzin, S. Bosse, Personal space in virtual reality interactions assessed with electroencephalography and skin conductance, in: Neuroscience 2025, San Diego, California, United States, 2025

2025
[67]

M. T. Lafci, B. Nierula, D. Damar, S. Bosse, Too Close for Comfort? Investigating Virtual Professor Distance and Student Learning in VR, in: IEEE SMC, Vienna, 2025

2025