TeamLLM: Exploring the Capabilities of LLMs for Multimodal Group Interaction Prediction

Daniel Khalkhali; Diana Romero; Salma Elmalaki; Xin Gao

arxiv: 2604.08771 · v1 · submitted 2026-04-09 · 💻 cs.HC · cs.ET

TeamLLM: Exploring the Capabilities of LLMs for Multimodal Group Interaction Prediction

Diana Romero , Xin Gao , Daniel Khalkhali , Salma Elmalaki This is my paper

Pith reviewed 2026-05-10 16:52 UTC · model grok-4.3

classification 💻 cs.HC cs.ET

keywords large language modelsgroup behavior predictionmultimodal sensingmixed realityteam coordinationconversation prediction

0 comments

The pith

Large language models can predict group conversation patterns from multimodal sensor data in mixed reality teams by encoding context as natural language, delivering 3.2 times the accuracy of LSTM baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether LLMs can forecast how people coordinate, communicate, and interact as teams during collaborative tasks. It converts streams of sensor data from mixed reality environments into layered text descriptions of individual actions, group structure, and timing, then tests LLMs in zero-shot, few-shot, and fine-tuned modes against statistical baselines. If the approach holds, it opens a path to real-time systems that anticipate team dynamics and support better collaboration without needing specialized models for every sensing task.

Core claim

When multimodal sensor data from 16 groups is turned into hierarchical natural language descriptions of individual profiles, group properties, and temporal context, LLMs deliver a 3.2 times performance gain over LSTM baselines on linguistically grounded behaviors. Supervised fine-tuning reaches 96 percent accuracy on conversation prediction while sustaining sub-35 millisecond latency. The same text-only models cannot capture non-linguistic patterns such as shared or joint attention and degrade sharply in simulation mode from cascading context errors.

What carries the argument

Hierarchical natural-language encoding of individual behavioral profiles, group structural properties, and temporal activity context, applied through zero-shot, few-shot, and supervised fine-tuning paradigms of LLMs.

If this is right

Text-based LLMs succeed on turn-taking and conversation prediction because these behaviors map to linguistic patterns.
Fine-tuned LLMs can deliver real-time, low-latency predictions suitable for live team-support applications.
Simulation of ongoing group interactions using these models is brittle because small context errors compound rapidly.
LLM performance on this task shows little dependence on the exact choice of few-shot examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that process both language and direct spatial-visual inputs will be needed to handle the full range of group coordination signals.
The identified performance boundaries can guide selection of LLMs versus other methods for sensing team dynamics in IoT and cyber-physical systems.
Applying the same encoding method to non-mixed-reality collaborative settings would reveal whether the language-based approach generalizes.

Load-bearing premise

Converting multimodal sensor streams into hierarchical natural-language descriptions preserves enough information for LLMs to predict non-linguistic group behaviors such as shared or joint attention.

What would settle it

A direct test in which the same text-encoded descriptions are fed to LLMs and accuracy on shared or joint attention tasks remains no better than chance while conversation prediction stays high.

Figures

Figures reproduced from arXiv: 2604.08771 by Daniel Khalkhali, Diana Romero, Salma Elmalaki, Xin Gao.

**Figure 1.** Figure 1: Methodology. Multimodal MR sensor data (gaze, audio, location, task state) are processed into sociograms, then encoded as hierarchical natural [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Prompt structure showing participant profiles, group metrics, temporal [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Predicting group behavior, how individuals coordinate, communicate, and interact during collaborative tasks, is essential for designing systems that can support team performance through real-time prediction and realistic simulation of collaborative scenarios. Large Language Models (LLMs) have shown promise for processing sensor data for human-activity recognition (HAR), yet their capabilities for team dynamics or group-level multimodal sensing remain unexplored. This paper investigates whether LLMs can predict group coordination patterns from multimodal sensor data in collaborative Mixed Reality (MR) environments. We encode hierarchical context -- individual behavioral profiles, group structural properties, and temporal activity context -- as natural language and evaluate three LLM adaptation paradigms (zero-shot, few-shot, and supervised fine-tuning) against statistical baselines. Our evaluation on 16 groups (64 participants, $\sim$25 hours of sensor data) reveals that LLMs achieve 3.2$\times$ improvement over LSTM baselines for linguistically-grounded behaviors, with fine-tuning reaching 96\% accuracy for conversation prediction while maintaining sub-35ms latency. Beyond performance gains, we characterize the boundaries of text-based LLMs for multimodal sensing conversation prediction succeeds because turn-taking maps to linguistic patterns, while shared or joint attention may require spatial and visual reasoning that text only LLMs cannot capture. We further identify simulation mode brittleness (83\% degradation from cascading context errors) and minimal few-shot sensitivity to example selection strategy. These findings establish guidelines when LLMs are appropriate for CPS/IoT sensing for team dynamics and inform the design of future multimodal foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs can beat LSTM baselines on conversation prediction in small-group MR data when sensor streams are turned into hierarchical text, but they fail on spatial behaviors and degrade sharply in simulation mode.

read the letter

The core finding is that LLMs reach 96% accuracy and 3.2x gains over LSTMs for turn-taking and conversation patterns once individual profiles, group structure, and timing are written out as natural language. On 16 groups and roughly 25 hours of data they also keep inference under 35 ms after fine-tuning. That is the usable result: text-based models handle linguistically grounded coordination without needing raw sensor fusion inside the model itself. They are clear that the same approach collapses for joint attention or shared gaze, which makes sense because those lack direct linguistic correlates. They also flag the 83% drop when the model has to simulate forward without fresh ground truth, which is a practical warning rather than a hidden flaw. The work is new in the narrow sense that prior LLM-for-HAR papers stayed at the individual level; extending the same encoding trick to group structure and testing the boundaries is a straightforward but useful step. The evaluation stays within its claims and does not overstate what text-only models can do. The main soft spots are the modest scale (16 groups) and the choice of LSTM as the main baseline; stronger modern sequence models or direct multimodal baselines would tighten the comparison. The conversion step from sensors to text is acknowledged as lossy for spatial tasks, so the paper does not pretend the method is universal. This is the kind of paper that belongs in a reading group on applied LLM sensing or collaborative systems. It gives concrete numbers and honest limits rather than broad claims. I would send it to peer review because the empirical scope is clear, the failures are reported, and the latency and accuracy figures are actionable for anyone building team-support tools in MR or IoT settings.

Referee Report

0 major / 3 minor

Summary. The paper investigates whether LLMs can predict group coordination patterns from multimodal sensor data in collaborative Mixed Reality environments. It encodes hierarchical context (individual profiles, group properties, temporal activity) as natural language and compares zero-shot, few-shot, and supervised fine-tuning paradigms against LSTM baselines. On data from 16 groups (64 participants, ~25 hours), LLMs show a 3.2× improvement over baselines for linguistically-grounded behaviors such as conversation prediction (reaching 96% accuracy with fine-tuning and sub-35ms latency), while explicitly reporting failure on spatial tasks like joint attention and 83% degradation in simulation mode due to cascading errors. The work scopes claims to linguistic behaviors where turn-taking aligns with text patterns and provides guidelines for LLM use in CPS/IoT team dynamics sensing.

Significance. If the empirical results hold under the reported boundaries, the work is significant for establishing when text-based LLMs are suitable for multimodal group behavior prediction. It provides a balanced assessment by highlighting successes in linguistically-grounded tasks alongside clear failures in spatial reasoning and simulation, which can guide design of future multimodal foundation models. Strengths include the real-world MR dataset size, explicit negative results, and scoping of positive claims to avoid overgeneralization.

minor comments (3)

Abstract: The sentence beginning 'Beyond performance gains, we characterize the boundaries of text-based LLMs for multimodal sensing conversation prediction succeeds because...' is missing punctuation and reads as a run-on; a period after 'sensing' would improve readability.
The manuscript would benefit from a summary table (perhaps in §4 or §5) aggregating accuracy, latency, and degradation metrics across all behaviors, paradigms, and baselines for quick comparison.
While the abstract notes the 16-group dataset, additional minor clarification on participant demographics, sensor calibration procedures, or inter-rater reliability for any ground-truth labels would strengthen reproducibility without altering the central claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and balanced assessment of our manuscript. We appreciate the recognition of the work's significance in scoping LLM capabilities for multimodal group interaction prediction, the value of the real-world MR dataset, and the explicit reporting of limitations such as failures on spatial tasks and simulation-mode degradation. We are pleased with the recommendation for minor revision and will incorporate any specific editorial suggestions in the revised version.

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper reports an empirical comparison of LLM adaptation paradigms (zero-shot, few-shot, fine-tuning) against LSTM and statistical baselines on a fixed dataset of 16 groups and ~25 hours of multimodal sensor data. Core claims rest on measured accuracies (e.g., 96% for conversation prediction) and relative improvements (3.2×), obtained by encoding sensor streams as hierarchical natural-language text and running standard inference or training loops. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the provided text; the evaluation pipeline is externally falsifiable against the collected data and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that text encodings of sensor data are sufficient for LLM-based prediction of group coordination; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Hierarchical context (individual profiles, group structure, temporal activity) can be faithfully encoded as natural language without loss of predictive information for linguistic behaviors.
Invoked when the paper states that encoding context as natural language enables LLM prediction of conversation patterns.

pith-pipeline@v0.9.0 · 5584 in / 1386 out tokens · 41065 ms · 2026-05-10T16:52:44.249719+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We encode hierarchical context—individual behavioral profiles, group structural properties, and temporal activity context—as natural language and evaluate three LLM adaptation paradigms (zero-shot, few-shot, and supervised fine-tuning) against statistical baselines... LLMs achieve 3.2× improvement over LSTM baselines for linguistically-grounded behaviors
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

conversation prediction succeeds because turn-taking maps to linguistic patterns, while shared or joint attention may require spatial and visual reasoning that text only LLMs cannot capture

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

To- ward sensor-in-the-loop llm agent: Benchmarks and implications,

Z. Ren, J. Li, M. Zhang, D. Wang, X. Fan, and L. Shangguan, “To- ward sensor-in-the-loop llm agent: Benchmarks and implications,” inProc. ACM SenSys, 2025

work page 2025
[2]

Language models can improve event prediction by few- shot abductive reasoning,

X. Shi, S. Xue, K. Wang, F. Zhou, J. Zhang, J. Zhou, C. Tan, and H. Mei, “Language models can improve event prediction by few- shot abductive reasoning,”NeurIPS, vol. 36, 2023

work page 2023
[3]

Exploring the capabilities of llms for imu-based fine-grained human activity understanding,

L. Xu, K. Hou, and X. Jiang, “Exploring the capabilities of llms for imu-based fine-grained human activity understanding,” inProceed- ings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things, 2025

work page 2025
[4]

Murmr: A multimodal sensing framework for automated group behavior anal- ysis in mixed reality,

D. Romero, Y . Chandio, F. Anwar, and S. Elmalaki, “Murmr: A multimodal sensing framework for automated group behavior anal- ysis in mixed reality,”arXiv preprint arXiv:2507.11797, 2025

work page arXiv 2025
[5]

Pointpresence: An online habitat for multi-user mixed reality telep- resence,

E. Chai, K. Apicharttrisorn, L. Wang, H. Chang, and S. Mukherjee, “Pointpresence: An online habitat for multi-user mixed reality telep- resence,” inProc. ACM MobiSys, 2025

work page 2025
[6]

Pentland,Honest Signals: How They Shape Our World

A. Pentland,Honest Signals: How They Shape Our World. MIT Press, 2010

work page 2010
[7]

Network ecology and adolescent social structure,

D. A. McFarland, J. Moody, D. Diehl, J. A. Smith, and R. J. Thomas, “Network ecology and adolescent social structure,”Amer- ican Sociological Review, 2014

work page 2014
[8]

Sensing meets mobile social networks: The design, implementation and evaluation of the cenceme application,

E. Miluzzo, N. D. Lane, K. Fodor, R. Peterson, H. Lu, M. Musolesi, S. B. Eisenman, X. Zheng, and A. T. Campbell, “Sensing meets mobile social networks: The design, implementation and evaluation of the cenceme application,” inProc. ACM SenSys, 2008

work page 2008
[9]

So- ciometric badges: Using sensor technology to capture new forms of collaboration,

T. Kim, E. McFee, D. O. Olguin, B. Waber, and A. Pentland, “So- ciometric badges: Using sensor technology to capture new forms of collaboration,”Journal of Organizational Behavior, 2012

work page 2012
[10]

Teamsense: Assessing personal affect and group cohesion in small teams through dyadic interaction and behavior analysis with wearable sensors,

Y . Zhang, J. Olenick, C.-H. Chang, S. W. J. Kozlowski, and H. Hung, “Teamsense: Assessing personal affect and group cohesion in small teams through dyadic interaction and behavior analysis with wearable sensors,”Proc. ACM IMWUT, 2018

work page 2018
[11]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neu- ral Computation, 1997

work page 1997
[12]

Aroma: A deep multi- task learning based simple and complex human activity recognition method using wearable sensors,

L. Peng, L. Chen, Z. Ye, and Y . Zhang, “Aroma: A deep multi- task learning based simple and complex human activity recognition method using wearable sensors,”Proc. ACM IMWUT, 2018

work page 2018
[13]

Time-llm: Time series forecasting by reprogramming large language models,

M. Jin, S. Wang, L. Ma, Z. Chu, J. Y . Zhang, X. Shi, P.-Y . Chen, Y . Liang, Y .-F. Li, S. Pan, and Q. Wen, “Time-llm: Time series forecasting by reprogramming large language models,”ICLR, 2024

work page 2024
[14]

Generative agents: Interactive simulacra of human behavior,

J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProc. ACM UIST, 2023

work page 2023
[15]

MoCoMR: A collabo- rative MR simulator with individual behavior modeling,

D. Romero, F. Anwar, and S. Elmalaki, “MoCoMR: A collabo- rative MR simulator with individual behavior modeling,” inProc. HumanSense Workshop, 2025

work page 2025
[16]

Model- ing social interaction dynamics using temporal graph networks,

J. T. Kim, A. Naik, I. Jayarathne, S. Ha, and J. Y . Chew, “Model- ing social interaction dynamics using temporal graph networks,” in Proc. IEEE RO-MAN, 2024

work page 2024
[17]

Towards immersive collaborative sensemaking,

Y . Yang, T. Dwyer, M. Wybrow, B. Lee, M. Cordeil, M. Billinghurst, and B. H. Thomas, “Towards immersive collaborative sensemaking,”Proc. ACM HCI (ISS), 2022

work page 2022
[18]

Introducing the open af- fective standardized image set (oasis),

B. Kurdi, S. Lozano, and M. R. Banaji, “Introducing the open af- fective standardized image set (oasis),”Behavior Research Methods, 2017

work page 2017
[19]

Groupbeamr: Analyzing collaborative group behavior in mixed reality through passive sensing and sociometry,

D. Romero, Y . Chandio, F. Anwar, and S. Elmalaki, “Groupbeamr: Analyzing collaborative group behavior in mixed reality through passive sensing and sociometry,”arXiv preprint arXiv:2411.05258, 2024

work page arXiv 2024
[20]

What sensors see, what people feel: An exploratory study of subjective collab- oration perception in mixed reality,

Y . Chandio, D. Romero, S. Elmalaki, and F. Anwar, “What sensors see, what people feel: An exploratory study of subjective collab- oration perception in mixed reality,” inProceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR), 2026

work page 2026
[21]

Moore and P

C. Moore and P. J. Dunham,Joint Attention: Its Origins and Role in Development. Psychology Press, 2014

work page 2014
[22]

Teams in organizations: From input-process-output models to IMOI models,

D. R. Ilgen, J. R. Hollenbeck, M. Johnson, and D. Jundt, “Teams in organizations: From input-process-output models to IMOI models,” Annual Review of Psychology, 2005

work page 2005
[23]

The influence of shared mental models on team process and performance,

J. E. Mathieu, T. S. Heffner, G. F. Goodwin, E. Salas, and J. A. Cannon-Bowers, “The influence of shared mental models on team process and performance,”Journal of Applied Psychology, 2000

work page 2000
[24]

Lora: Low-rank adaptation of large language mod- els,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language mod- els,”ICLR, 2022

work page 2022

[1] [1]

To- ward sensor-in-the-loop llm agent: Benchmarks and implications,

Z. Ren, J. Li, M. Zhang, D. Wang, X. Fan, and L. Shangguan, “To- ward sensor-in-the-loop llm agent: Benchmarks and implications,” inProc. ACM SenSys, 2025

work page 2025

[2] [2]

Language models can improve event prediction by few- shot abductive reasoning,

X. Shi, S. Xue, K. Wang, F. Zhou, J. Zhang, J. Zhou, C. Tan, and H. Mei, “Language models can improve event prediction by few- shot abductive reasoning,”NeurIPS, vol. 36, 2023

work page 2023

[3] [3]

Exploring the capabilities of llms for imu-based fine-grained human activity understanding,

L. Xu, K. Hou, and X. Jiang, “Exploring the capabilities of llms for imu-based fine-grained human activity understanding,” inProceed- ings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things, 2025

work page 2025

[4] [4]

Murmr: A multimodal sensing framework for automated group behavior anal- ysis in mixed reality,

D. Romero, Y . Chandio, F. Anwar, and S. Elmalaki, “Murmr: A multimodal sensing framework for automated group behavior anal- ysis in mixed reality,”arXiv preprint arXiv:2507.11797, 2025

work page arXiv 2025

[5] [5]

Pointpresence: An online habitat for multi-user mixed reality telep- resence,

E. Chai, K. Apicharttrisorn, L. Wang, H. Chang, and S. Mukherjee, “Pointpresence: An online habitat for multi-user mixed reality telep- resence,” inProc. ACM MobiSys, 2025

work page 2025

[6] [6]

Pentland,Honest Signals: How They Shape Our World

A. Pentland,Honest Signals: How They Shape Our World. MIT Press, 2010

work page 2010

[7] [7]

Network ecology and adolescent social structure,

D. A. McFarland, J. Moody, D. Diehl, J. A. Smith, and R. J. Thomas, “Network ecology and adolescent social structure,”Amer- ican Sociological Review, 2014

work page 2014

[8] [8]

Sensing meets mobile social networks: The design, implementation and evaluation of the cenceme application,

E. Miluzzo, N. D. Lane, K. Fodor, R. Peterson, H. Lu, M. Musolesi, S. B. Eisenman, X. Zheng, and A. T. Campbell, “Sensing meets mobile social networks: The design, implementation and evaluation of the cenceme application,” inProc. ACM SenSys, 2008

work page 2008

[9] [9]

So- ciometric badges: Using sensor technology to capture new forms of collaboration,

T. Kim, E. McFee, D. O. Olguin, B. Waber, and A. Pentland, “So- ciometric badges: Using sensor technology to capture new forms of collaboration,”Journal of Organizational Behavior, 2012

work page 2012

[10] [10]

Teamsense: Assessing personal affect and group cohesion in small teams through dyadic interaction and behavior analysis with wearable sensors,

Y . Zhang, J. Olenick, C.-H. Chang, S. W. J. Kozlowski, and H. Hung, “Teamsense: Assessing personal affect and group cohesion in small teams through dyadic interaction and behavior analysis with wearable sensors,”Proc. ACM IMWUT, 2018

work page 2018

[11] [11]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neu- ral Computation, 1997

work page 1997

[12] [12]

Aroma: A deep multi- task learning based simple and complex human activity recognition method using wearable sensors,

L. Peng, L. Chen, Z. Ye, and Y . Zhang, “Aroma: A deep multi- task learning based simple and complex human activity recognition method using wearable sensors,”Proc. ACM IMWUT, 2018

work page 2018

[13] [13]

Time-llm: Time series forecasting by reprogramming large language models,

M. Jin, S. Wang, L. Ma, Z. Chu, J. Y . Zhang, X. Shi, P.-Y . Chen, Y . Liang, Y .-F. Li, S. Pan, and Q. Wen, “Time-llm: Time series forecasting by reprogramming large language models,”ICLR, 2024

work page 2024

[14] [14]

Generative agents: Interactive simulacra of human behavior,

J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProc. ACM UIST, 2023

work page 2023

[15] [15]

MoCoMR: A collabo- rative MR simulator with individual behavior modeling,

D. Romero, F. Anwar, and S. Elmalaki, “MoCoMR: A collabo- rative MR simulator with individual behavior modeling,” inProc. HumanSense Workshop, 2025

work page 2025

[16] [16]

Model- ing social interaction dynamics using temporal graph networks,

J. T. Kim, A. Naik, I. Jayarathne, S. Ha, and J. Y . Chew, “Model- ing social interaction dynamics using temporal graph networks,” in Proc. IEEE RO-MAN, 2024

work page 2024

[17] [17]

Towards immersive collaborative sensemaking,

Y . Yang, T. Dwyer, M. Wybrow, B. Lee, M. Cordeil, M. Billinghurst, and B. H. Thomas, “Towards immersive collaborative sensemaking,”Proc. ACM HCI (ISS), 2022

work page 2022

[18] [18]

Introducing the open af- fective standardized image set (oasis),

B. Kurdi, S. Lozano, and M. R. Banaji, “Introducing the open af- fective standardized image set (oasis),”Behavior Research Methods, 2017

work page 2017

[19] [19]

Groupbeamr: Analyzing collaborative group behavior in mixed reality through passive sensing and sociometry,

D. Romero, Y . Chandio, F. Anwar, and S. Elmalaki, “Groupbeamr: Analyzing collaborative group behavior in mixed reality through passive sensing and sociometry,”arXiv preprint arXiv:2411.05258, 2024

work page arXiv 2024

[20] [20]

What sensors see, what people feel: An exploratory study of subjective collab- oration perception in mixed reality,

Y . Chandio, D. Romero, S. Elmalaki, and F. Anwar, “What sensors see, what people feel: An exploratory study of subjective collab- oration perception in mixed reality,” inProceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR), 2026

work page 2026

[21] [21]

Moore and P

C. Moore and P. J. Dunham,Joint Attention: Its Origins and Role in Development. Psychology Press, 2014

work page 2014

[22] [22]

Teams in organizations: From input-process-output models to IMOI models,

D. R. Ilgen, J. R. Hollenbeck, M. Johnson, and D. Jundt, “Teams in organizations: From input-process-output models to IMOI models,” Annual Review of Psychology, 2005

work page 2005

[23] [23]

The influence of shared mental models on team process and performance,

J. E. Mathieu, T. S. Heffner, G. F. Goodwin, E. Salas, and J. A. Cannon-Bowers, “The influence of shared mental models on team process and performance,”Journal of Applied Psychology, 2000

work page 2000

[24] [24]

Lora: Low-rank adaptation of large language mod- els,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language mod- els,”ICLR, 2022

work page 2022