pith. sign in

arxiv: 2506.13189 · v3 · pith:KCVQ5WC6new · submitted 2025-06-16 · 💻 cs.HC · cs.RO

Gesture First, LLM-Assisted Voice Complement: Exploring Multimodal Robot 'Puppeteer' Teleoperation Via Virtual Counterpart in Augmented Reality

Pith reviewed 2026-05-22 00:39 UTC · model grok-4.3

classification 💻 cs.HC cs.RO
keywords AR teleoperationmultimodal interactiongesture controlLLM voice commandsrobot puppeteeruser studypick-and-placedesign guidelines
0
0 comments X

The pith

Gesture-only control outperforms combined LLM voice and gesture for reliable AR robot teleoperation in time-critical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an AR puppeteer system on the Meta Quest 3 that lets users manipulate a virtual robot counterpart to drive a physical robot, testing LLM-assisted voice commands alongside hand gestures. A within-subject study with 42 participants compared gesture-only control to a sequential voice-plus-gesture setup on a robotic pick-and-place pattern-matching task. Results show gesture-only yields faster, more reliable performance while voice adds flexibility yet introduces latency and recognition errors that raise workload. The authors conclude that multimodality should be applied adaptively according to task urgency and user expertise rather than assumed to improve every interaction.

Core claim

In the AR puppeteer teleoperation setup, gesture-only interaction currently delivers more reliable and efficient control for time-critical robotic pick-and-place tasks compared to a sequential voice-plus-gesture approach, where voice manages high-level navigation and gestures handle fine manipulation; voice adds flexibility but brings latency and recognition issues that can increase user workload.

What carries the argument

The AR puppeteer metaphoric teleoperation system that maps user hand gestures and LLM voice commands on a virtual robot counterpart to control a physical robot, with sequential role allocation of modalities.

If this is right

  • Gesture-only control is preferable for performance when tasks impose strict time limits.
  • Voice integration increases flexibility but currently raises workload unless latency and recognition improve.
  • Prior robotics expertise changes how users experience the difference between the two conditions.
  • Design guidelines should treat added modalities as adaptive rather than universally beneficial.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future voice models with reduced latency could reverse the current performance edge of gesture-only.
  • Allowing parallel rather than strictly sequential use of voice and gesture might lower workload in less time-critical tasks.
  • The guidelines may extend to other high-precision teleoperation settings such as remote assembly or medical robotics.

Load-bearing premise

The assumption that the specific sequential allocation of voice to high-level commands and gestures to fine manipulation, together with present-day LLM recognition accuracy, fairly tests the potential of multimodality rather than reflecting implementation limits or task time pressure.

What would settle it

An experiment in which an improved multimodal system with lower voice latency and higher recognition accuracy produces equal or lower workload and equal or faster task times than gesture-only would falsify the preference for gesture-only.

Figures

Figures reproduced from arXiv: 2506.13189 by Bastian Orthmann, Danica Kragic, Jonne Van Haastregt, Michael Welle, Shichen Ji, Yuchong Zhang.

Figure 1
Figure 1. Figure 1: The proposed multimodal robot puppeteer system with [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System realization of the AR puppeteer framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hand gestures designed for interacting with the virtual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example scenes of the setup and user study. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An overview of the user study workflow. TABLE I: Self-developed measurement metrics in the user study. Metric Category Definition/Explanation Range Performance Metrics Number of Cubes Number of cubes successfully placed in the corresponding colored cells in the target zone. - Unsuccessful Attempts The robotic gripper attempted to grasp but failed; or the cube was not placed fully inside the designated colo… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of performance metrics in the study. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of UX and usability metrics, including UEQ-S, NASA TLX, customized metrics, and UEQ-S subscales. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Themes and details from qualitative results. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparative results of performance between roboticists and non-roboticists. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of UX and usability metrics between roboticists and non-roboticists. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

Robot teleoperation via augmented reality (AR) offers a promising path toward more intuitive human-robot interaction (HRI). We present a head-mounted AR 'puppeteer' system in which users control a physical robot by interacting with its virtual counterpart robot using large language model (LLM)-assisted voice commands and hand-gesture interaction on the Meta Quest 3. In a within-subject user study with 42 participants performing an AR-based robotic pick-and-place pattern-matching task, we empirically compare two interaction conditions: gesture-only (GO) and combined voice+gesture (VG) on performance and user experience (UX). In VG, voice and gesture operate in a sequential role-allocated manner, with voice handling high-level navigation and gesture handling fine manipulation. Our results show that GO currently provides more reliable and efficient control for this time-critical task, while VG introduces additional flexibility but also latency and recognition issues that can increase workload. We additionally analyze how prior robotics expertise differentiates performance and UX across conditions. Based on these findings, we distill a set of design guidelines for AR 'puppeteer' metaphoric robot teleoperation, framing multimodality as an adaptive strategy that must balance efficiency, robustness, and user expertise rather than assuming that additional modalities are universally beneficial.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents an AR 'puppeteer' teleoperation system on the Meta Quest 3 in which users manipulate a virtual robot counterpart to control a physical robot, using hand gestures and LLM-assisted voice commands. A within-subjects study with 42 participants compares a gesture-only (GO) condition against a sequential voice+gesture (VG) condition on a time-critical pick-and-place pattern-matching task. The central empirical finding is that GO currently yields more reliable and efficient performance while VG adds flexibility at the cost of latency and recognition errors that elevate workload; the paper further examines moderation by prior robotics expertise and distills design guidelines that treat multimodality as an adaptive rather than universally superior strategy.

Significance. If the reported performance differences hold under scrutiny, the work offers a timely, practically oriented contribution to human-robot interaction and AR interfaces. The within-subject design with n=42 supplies a reasonable sample for comparative claims, and the explicit framing of VG limitations as implementation-specific rather than inherent to multimodality avoids overgeneralization. The resulting design guidelines constitute a usable takeaway for future AR puppeteer systems.

major comments (2)
  1. [§5] §5 (Results): the manuscript states that GO is more reliable and efficient and that VG increases workload via latency and recognition issues, yet neither the abstract nor the results summary supplies the concrete metrics (task completion time means and SDs, error rates, NASA-TLX or similar workload scores) or the statistical tests (paired t-tests, Wilcoxon signed-rank, exact p-values, effect sizes) used to support these claims. Without these numbers the strength of the central comparative conclusion cannot be fully evaluated.
  2. [§4.2] §4.2 (VG condition description): the sequential role allocation (voice for high-level navigation, gesture for fine manipulation) is presented as the tested multimodal strategy, but the paper provides no ablation or rationale comparing it to simultaneous multimodal use or to alternative voice-trigger timings. Because this allocation directly produces the reported latency penalty, the finding that VG is currently inferior remains tied to one specific implementation choice rather than testing the broader multimodal hypothesis.
minor comments (3)
  1. [Abstract] The abstract should be expanded to include at least one quantitative performance contrast and the latency measurement method so that readers can immediately gauge the magnitude of the reported differences.
  2. [Figures] Figure captions and axis labels in the results figures should explicitly state the units and the exact statistical comparison shown (e.g., 'mean completion time (s) with 95% CI; paired t-test').
  3. [Discussion] A short paragraph in the discussion should address how the observed latency values compare with current commercial LLM voice latencies to help readers judge whether the VG drawbacks are transient or structural.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive recommendation for minor revision. We agree that greater transparency in reporting quantitative results and additional context for the multimodal design choice will strengthen the manuscript. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [§5] §5 (Results): the manuscript states that GO is more reliable and efficient and that VG increases workload via latency and recognition issues, yet neither the abstract nor the results summary supplies the concrete metrics (task completion time means and SDs, error rates, NASA-TLX or similar workload scores) or the statistical tests (paired t-tests, Wilcoxon signed-rank, exact p-values, effect sizes) used to support these claims. Without these numbers the strength of the central comparative conclusion cannot be fully evaluated.

    Authors: We agree that the abstract and results summary would benefit from explicit quantitative support. The full results section already contains the requested metrics and tests; we will revise the abstract and add a concise summary table or paragraph in the results section to report mean task completion times with SDs, error rates, NASA-TLX scores, paired statistical tests, exact p-values, and effect sizes. revision: yes

  2. Referee: [§4.2] §4.2 (VG condition description): the sequential role allocation (voice for high-level navigation, gesture for fine manipulation) is presented as the tested multimodal strategy, but the paper provides no ablation or rationale comparing it to simultaneous multimodal use or to alternative voice-trigger timings. Because this allocation directly produces the reported latency penalty, the finding that VG is currently inferior remains tied to one specific implementation choice rather than testing the broader multimodal hypothesis.

    Authors: We chose sequential allocation to avoid input conflicts and to let voice handle coarse navigation while gestures manage fine control, consistent with established multimodal HRI practices that reduce cognitive load. We will expand §4.2 to state this rationale explicitly and to note that results apply to this sequential implementation rather than multimodality in general. A full ablation against simultaneous use or different trigger timings would require additional experimental conditions and is beyond the scope of the present study. revision: partial

Circularity Check

0 steps flagged

Empirical user study with no derivations or self-referential predictions

full rationale

This paper reports results from a within-subject user study (n=42) measuring performance and UX in a time-critical pick-and-place task under GO vs. sequential VG conditions. All claims derive directly from participant data and observed metrics rather than equations, fitted parameters, or derivations. No load-bearing self-citations, uniqueness theorems, or ansatzes appear; the design guidelines are distilled post-hoc from the measured outcomes. The work is self-contained against external benchmarks with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical HCI user study and therefore rests on standard methodological assumptions rather than new mathematical axioms or invented physical entities.

axioms (1)
  • domain assumption Within-subject design adequately controls for individual differences in comparing the two interaction conditions.
    Invoked by the choice of study protocol to isolate effects of interaction mode.

pith-pipeline@v0.9.0 · 5789 in / 1244 out tokens · 79875 ms · 2026-05-22T00:39:59.554904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    Robotic process automation and artificial intelligence in industry 4.0–a literature review,

    J. Ribeiro, R. Lima, T. Eckhardt, and S. Paiva, “Robotic process automation and artificial intelligence in industry 4.0–a literature review,” Procedia Computer Science , vol. 181, pp. 51–58, 2021

  2. [2]

    Human–robot interaction in industrial collaborative robotics: a literature review of the decade 2008–2017,

    A. Hentout, M. Aouache, A. Maoudj, and I. Akli, “Human–robot interaction in industrial collaborative robotics: a literature review of the decade 2008–2017,” Advanced Robotics , vol. 33, no. 15-16, pp. 764– 799, 2019

  3. [3]

    State of the art: a study of human-robot interaction in healthcare,

    I. Olaronke, O. Oluwaseun, and I. Rhoda, “State of the art: a study of human-robot interaction in healthcare,” International Journal of Information Engineering and Electronic Business , vol. 9, no. 3, p. 43, 2017

  4. [4]

    Socializing with robots: Human-robot interactions within a virtual environment,

    A. Richert, M. A. Shehadeh, S. L. M ¨uller, S. Schr ¨oder, and S. Jeschke, “Socializing with robots: Human-robot interactions within a virtual environment,” in 2016 IEEE workshop on advanced robotics and its social impacts (ARSO) . IEEE, 2016, pp. 49–54. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  5. [5]

    Effect of cognitive fatigue, operator sex, and robot assistance on task performance metrics, workload, and situation awareness in human-robot collabora- tion,

    S. K. Hopko, R. Khurana, R. K. Mehta, and P. R. Pagilla, “Effect of cognitive fatigue, operator sex, and robot assistance on task performance metrics, workload, and situation awareness in human-robot collabora- tion,” IEEE Robotics and Automation Letters , vol. 6, no. 2, pp. 3049– 3056, 2021

  6. [6]

    Evaluation of user experience in human–robot interaction: a systematic literature review,

    A. Apraiz, G. Lasa, and M. Mazmela, “Evaluation of user experience in human–robot interaction: a systematic literature review,” International Journal of Social Robotics , vol. 15, no. 2, pp. 187–210, 2023

  7. [7]

    Trends in augmented reality tracking, interaction and display: A review of ten years of ismar,

    F. Zhou, H. B.-L. Duh, and M. Billinghurst, “Trends in augmented reality tracking, interaction and display: A review of ten years of ismar,” in 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality. IEEE, 2008, pp. 193–202

  8. [8]

    Navigating the landscape for real-time localization and mapping for robotics and virtual and augmented reality,

    S. Saeedi, B. Bodin, H. Wagstaff, A. Nisbet, L. Nardi, J. Mawer, N. Melot, O. Palomar, E. Vespa, T. Spink et al. , “Navigating the landscape for real-time localization and mapping for robotics and virtual and augmented reality,” Proceedings of the IEEE , vol. 106, no. 11, pp. 2020–2039, 2018

  9. [9]

    Aug- mented reality and robotics: A survey and taxonomy for ar-enhanced human-robot interaction and robotic interfaces,

    R. Suzuki, A. Karim, T. Xia, H. Hedayati, and N. Marquardt, “Aug- mented reality and robotics: A survey and taxonomy for ar-enhanced human-robot interaction and robotic interfaces,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , 2022, pp. 1–33

  10. [10]

    A study on view sharing ar interface for improving situation awareness during military operations,

    G. S. Yoo and Y . G. Ji, “A study on view sharing ar interface for improving situation awareness during military operations,” International Journal of Human–Computer Interaction, vol. 41, no. 4, pp. 2211–2226, 2025

  11. [11]

    Com- bining interactive spatial augmented reality with head-mounted display for end-user collaborative robot programming,

    D. Bambu ˆsek, Z. Materna, M. Kapinus, V . Beran, and P. Smr ˇz, “Com- bining interactive spatial augmented reality with head-mounted display for end-user collaborative robot programming,” in 2019 28th IEEE international conference on robot and human interactive communication (RO-MAN). IEEE, 2019, pp. 1–8

  12. [12]

    Intuitive robot tasks with augmented reality and virtual obstacles,

    A. Gaschler, M. Springer, M. Rickert, and A. Knoll, “Intuitive robot tasks with augmented reality and virtual obstacles,” in 2014 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2014, pp. 6026–6031

  13. [13]

    Robot pro- gramming using augmented reality: An interactive method for planning collision-free paths,

    J. W. S. Chong, S. Ong, A. Y . Nee, and K. Youcef-Youmi, “Robot pro- gramming using augmented reality: An interactive method for planning collision-free paths,” Robotics and Computer-Integrated Manufacturing, vol. 25, no. 3, pp. 689–701, 2009

  14. [14]

    Ar- supported human-robot collaboration: Facilitating workspace awareness and parallelized assembly tasks,

    R. S. Lunding, M. N. Lystbæk, T. Feuchtner, and K. Grønbæk, “Ar- supported human-robot collaboration: Facilitating workspace awareness and parallelized assembly tasks,” in2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) . IEEE, 2023, pp. 1064– 1073

  15. [15]

    Communicating robot motion intent with augmented reality,

    M. Walker, H. Hedayati, J. Lee, and D. Szafir, “Communicating robot motion intent with augmented reality,” in Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, 2018, pp. 316–324

  16. [16]

    Augmented reality for supporting workers in human–robot collaboration,

    A. Moya, L. Bastida, P. Aguirrezabal, M. Pantano, and P. Abril- Jim´enez, “Augmented reality for supporting workers in human–robot collaboration,” Multimodal Technologies and Interaction , vol. 7, no. 4, p. 40, 2023

  17. [17]

    Assisting manipulation and grasping in robot teleoperation with augmented reality visual cues,

    S. Arevalo Arboleda, F. R ¨ucker, T. Dierks, and J. Gerken, “Assisting manipulation and grasping in robot teleoperation with augmented reality visual cues,” in Proceedings of the 2021 CHI conference on human factors in computing systems , 2021, pp. 1–14

  18. [18]

    Augmented reality for robotics: A review,

    Z. Makhataeva and H. A. Varol, “Augmented reality for robotics: A review,”Robotics, vol. 9, no. 2, p. 21, 2020

  19. [19]

    Autonomous humanoid robot navigation using augmented reality technique,

    O. Mohareri and A. B. Rad, “Autonomous humanoid robot navigation using augmented reality technique,” in 2011 IEEE International Con- ference on Mechatronics. IEEE, 2011, pp. 463–468

  20. [20]

    Agents that talk and hit back: Animated agents in augmented reality,

    I. Barakonyi, T. Psik, and D. Schmalstieg, “Agents that talk and hit back: Animated agents in augmented reality,” in Third IEEE and ACM International Symposium on Mixed and Augmented Reality . IEEE, 2004, pp. 141–150

  21. [21]

    Puppeteer your robot: Augmented reality leader-follower teleoperation,

    J. Van Haastregt, M. C. Welle, Y . Zhang, and D. Kragic, “Puppeteer your robot: Augmented reality leader-follower teleoperation,” in 2024 IEEE- RAS 23rd International Conference on Humanoid Robots (Humanoids) . IEEE, 2024, pp. 1019–1026

  22. [22]

    A friendly gesture: Investigating the effect of multimodal robot behavior in human-robot interaction,

    M. Salem, K. Rohlfing, S. Kopp, and F. Joublin, “A friendly gesture: Investigating the effect of multimodal robot behavior in human-robot interaction,” in 2011 ro-man. IEEE, 2011, pp. 247–252

  23. [23]

    Eval- uation of unimodal and multimodal communication cues for attracting attention in human–robot interaction,

    E. Torta, J. van Heumen, F. Piunti, L. Romeo, and R. Cuijpers, “Eval- uation of unimodal and multimodal communication cues for attracting attention in human–robot interaction,” International Journal of Social Robotics, vol. 7, pp. 89–96, 2015

  24. [24]

    Multimodal approach to affective human-robot interaction design with children,

    S. Y . Okita, V . Ng-Thow-Hing, and R. K. Sarvadevabhatla, “Multimodal approach to affective human-robot interaction design with children,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 1, no. 1, pp. 1–29, 2011

  25. [25]

    Design of seamless multi-modal interaction framework for intelligent virtual agents in wearable mixed reality environment,

    G. Ali, H.-Q. Le, J. Kim, S.-W. Hwang, and J.-I. Hwang, “Design of seamless multi-modal interaction framework for intelligent virtual agents in wearable mixed reality environment,” in Proceedings of the 32nd International Conference on Computer Animation and Social Agents , 2019, pp. 47–52

  26. [26]

    Ten myths of multimodal interaction,

    S. Oviatt, “Ten myths of multimodal interaction,” Communications of the ACM, vol. 42, no. 11, pp. 74–81, 1999

  27. [27]

    A usability study of multimodal input in an augmented reality environment,

    M. Lee, M. Billinghurst, W. Baek, R. Green, and W. Woo, “A usability study of multimodal input in an augmented reality environment,” Virtual Reality, vol. 17, pp. 293–305, 2013

  28. [28]

    Multimodal human-robot interface for accessible remote robotic interventions in hazardous environments,

    G. Lunghi, R. Marin, M. Di Castro, A. Masi, and P. J. Sanz, “Multimodal human-robot interface for accessible remote robotic interventions in hazardous environments,” IEEE Access , vol. 7, pp. 127 290–127 319, 2019

  29. [29]

    Development and testing of a multimodal acquisition platform for human-robot interaction affective studies,

    N. Lazzeri, D. Mazzei, and D. De Rossi, “Development and testing of a multimodal acquisition platform for human-robot interaction affective studies,” Journal of Human-Robot Interaction , vol. 3, no. 2, pp. 1–24, 2014

  30. [30]

    Investigating the role of multi-modal social cues in human-robot collaboration in industrial settings,

    H.-L. Cao, C. Scholz, J. De Winter, I. E. Makrini, and B. Vanderborght, “Investigating the role of multi-modal social cues in human-robot collaboration in industrial settings,” International Journal of Social Robotics, vol. 15, no. 7, pp. 1169–1179, 2023

  31. [31]

    Virtual, augmented, and mixed reality for human-robot interaction,

    T. Williams, D. Szafir, T. Chakraborti, and H. Ben Amor, “Virtual, augmented, and mixed reality for human-robot interaction,” in Compan- ion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, 2018, pp. 403–404

  32. [32]

    Projective virtual reality: Bridging the gap between virtual reality and robotics,

    E. Freund and J. Rossmann, “Projective virtual reality: Bridging the gap between virtual reality and robotics,” IEEE transactions on robotics and automation, vol. 15, no. 3, pp. 411–422, 1999

  33. [33]

    Design of a virtual reality training system for human–robot collaboration in manufacturing tasks,

    E. Matsas and G.-C. V osniakos, “Design of a virtual reality training system for human–robot collaboration in manufacturing tasks,” Inter- national Journal on Interactive Design and Manufacturing (IJIDeM) , vol. 11, pp. 139–153, 2017

  34. [34]

    A simulator for human-robot interaction in virtual reality,

    M. Murnane, P. Higgins, M. Saraf, F. Ferraro, C. Matuszek, and D. Engel, “A simulator for human-robot interaction in virtual reality,” in 2021 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) . IEEE, 2021, pp. 470–471

  35. [35]

    Use of virtual reality for the evaluation of human-robot interaction systems in complex scenarios,

    V . Villani, B. Capelli, and L. Sabattini, “Use of virtual reality for the evaluation of human-robot interaction systems in complex scenarios,” in 2018 27th IEEE international symposium on robot and human interactive communication (RO-MAN) . IEEE, 2018, pp. 422–427

  36. [36]

    An augmented reality interface for human- robot interaction in unconstrained environments,

    S. M. Chacko and V . Kapila, “An augmented reality interface for human- robot interaction in unconstrained environments,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2019, pp. 3222–3228

  37. [37]

    Augmented reality (ar) applications for supporting human- robot interactive cooperation,

    G. Michalos, P. Karagiannis, S. Makris, ¨O. Tokc ¸alar, and G. Chrys- solouris, “Augmented reality (ar) applications for supporting human- robot interactive cooperation,” Procedia CIRP , vol. 41, pp. 370–375, 2016

  38. [38]

    Improving collocated robot teleoperation with augmented reality,

    H. Hedayati, M. Walker, and D. Szafir, “Improving collocated robot teleoperation with augmented reality,” in Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction , 2018, pp. 78–86

  39. [39]

    Robot teleoperation with augmented reality virtual surrogates,

    M. E. Walker, H. Hedayati, and D. Szafir, “Robot teleoperation with augmented reality virtual surrogates,” in 2019 14th ACM/IEEE Interna- tional Conference on Human-Robot Interaction (HRI). IEEE, 2019, pp. 202–210

  40. [40]

    The virbot: a virtual reality robot driven with multimodal commands,

    J. Savage-Carmona, M. Billinghurst, and A. Holden, “The virbot: a virtual reality robot driven with multimodal commands,” Expert Systems with Applications, vol. 15, no. 3-4, pp. 413–419, 1998

  41. [41]

    Hands-free human–robot interaction using multimodal gestures and deep learning in wearable mixed reality,

    K.-B. Park, S. H. Choi, J. Y . Lee, Y . Ghasemi, M. Mohammed, and H. Jeong, “Hands-free human–robot interaction using multimodal gestures and deep learning in wearable mixed reality,” IEEE Access , vol. 9, pp. 55 448–55 464, 2021

  42. [42]

    A multimodal system using augmented reality, gestures, and tactile feedback for robot trajectory programming and execution,

    W. P. Chan, C. P. Quintero, M. K. Pan, M. Sakr, H. M. Van der Loos, and E. Croft, “A multimodal system using augmented reality, gestures, and tactile feedback for robot trajectory programming and execution,” in Virtual Reality. River Publishers, 2022, pp. 142–158

  43. [43]

    Multimodal multi-user mixed reality human–robot inter- face for remote operations in hazardous environments,

    K. A. Szczurek, R. M. Prades, E. Matheson, J. Rodriguez-Nogueira, and M. Di Castro, “Multimodal multi-user mixed reality human–robot inter- face for remote operations in hazardous environments,” IEEE Access , vol. 11, pp. 17 305–17 333, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

  44. [44]

    Novel ar-based interface for human- robot interaction and visualization,

    H. Fang, S.-K. Ong, and A. Y . Nee, “Novel ar-based interface for human- robot interaction and visualization,” Advances in Manufacturing , vol. 2, pp. 275–288, 2014

  45. [45]

    Human-robot interaction for robotic manipulator programming in mixed reality,

    M. Ostanin, S. Mikhel, A. Evlampiev, V . Skvortsova, and A. Klim- chik, “Human-robot interaction for robotic manipulator programming in mixed reality,” in 2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 2805–2811

  46. [46]

    Towards human-level semantics under- standing of human-centered object manipulation tasks for hri: reasoning about effect, ability, effort and perspective taking,

    A. K. Pandey and R. Alami, “Towards human-level semantics under- standing of human-centered object manipulation tasks for hri: reasoning about effect, ability, effort and perspective taking,”International Journal of Social Robotics , vol. 6, pp. 593–620, 2014

  47. [47]

    Exploring the interplay of visual and haptic modalities in a pattern-matching task,

    K. Seaborn, B. E. Riecke, and A. N. Antle, “Exploring the interplay of visual and haptic modalities in a pattern-matching task,” in 2010 IEEE International Symposium on Haptic Audio Visual Environments and Games. IEEE, 2010, pp. 1–6

  48. [48]

    Matching and reaching depth judgments with real and augmented reality targets,

    J. E. Swan, G. Singh, and S. R. Ellis, “Matching and reaching depth judgments with real and augmented reality targets,” IEEE transactions on visualization and computer graphics, vol. 21, no. 11, pp. 1289–1298, 2015

  49. [49]

    Visual pattern matching: an investigation of some effects of decision task, auditory codability, and spatial corre- spondence

    R. Nickerson and R. Pew, “Visual pattern matching: an investigation of some effects of decision task, auditory codability, and spatial corre- spondence.” Journal of Experimental Psychology , vol. 98, no. 1, p. 36, 1973

  50. [50]

    Design and evalu- ation of a short version of the user experience questionnaire (ueq-s)

    M. Schrepp, A. Hinderks, and J. Thomaschewski, “Design and evalu- ation of a short version of the user experience questionnaire (ueq-s).” International Journal of Interactive Multimedia & Artificial Intelligence, vol. 4, no. 6, 2017

  51. [51]

    Design and eval- uation of data annotation workflows for cave-like virtual environments,

    S. Pick, B. Weyers, B. Hentschel, and T. W. Kuhlen, “Design and eval- uation of data annotation workflows for cave-like virtual environments,” IEEE transactions on visualization and computer graphics , vol. 22, no. 4, pp. 1452–1461, 2016

  52. [52]

    Usability, accep- tance, and trust of privacy protection mechanisms and identity manage- ment in social virtual reality,

    J. Lin, C. Rack, C. Wienrich, and M. E. Latoschik, “Usability, accep- tance, and trust of privacy protection mechanisms and identity manage- ment in social virtual reality,” in 2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) . IEEE, 2024, pp. 130–139

  53. [53]

    Investigating the impact of construction robots autonomy level on workers’ cognitive load,

    S. Shayesteh and H. Jebelli, “Investigating the impact of construction robots autonomy level on workers’ cognitive load,” in Canadian Society of Civil Engineering Annual Conference . Springer, 2021, pp. 255–267

  54. [54]

    Handheld guides in inspection tasks: Augmented reality versus picture,

    J. Polvi, T. Taketomi, A. Moteki, T. Yoshitake, T. Fukuoka, G. Ya- mamoto, C. Sandor, and H. Kato, “Handheld guides in inspection tasks: Augmented reality versus picture,” IEEE transactions on visualization and computer graphics , vol. 24, no. 7, pp. 2118–2128, 2017

  55. [55]

    Glanxr: A hands-free fast switching system for virtual screens,

    G. Zhao, J. Orlosky, K. Kiyokawa, and Y . Uranishi, “Glanxr: A hands-free fast switching system for virtual screens,” in 2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) . IEEE, 2024, pp. 111–119

  56. [56]

    Perceived empathy in mixed reality: Assessing the impact of empathic agents’ awareness of user physiological states,

    Z. Chang, K. Kim, K. Gupta, J. Abouelenin, Z. Xiao, B. Gu, H. Bai, and M. Billinghurst, “Perceived empathy in mixed reality: Assessing the impact of empathic agents’ awareness of user physiological states,” in 2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2024, pp. 406–415