SemanticScanpath: Combining Gaze and Speech for Situated Human-Robot Interaction Using LLMs

Anna Belardinelli; Carlos Balaguer; Elisabeth Menendez; Michael Gienger; Santiago Mart\'inez

arxiv: 2503.16548 · v2 · submitted 2025-03-19 · 💻 cs.HC · cs.RO

SemanticScanpath: Combining Gaze and Speech for Situated Human-Robot Interaction Using LLMs

Elisabeth Menendez , Michael Gienger , Santiago Mart\'inez , Carlos Balaguer , Anna Belardinelli This is my paper

Pith reviewed 2026-05-22 23:59 UTC · model grok-4.3

classification 💻 cs.HC cs.RO

keywords gaze scanpathhuman-robot interactionlarge language modelssituated awarenessreferential gazespeech groundingmultimodal input

0 comments

The pith

Converting gaze scanpaths to text lets LLMs resolve ambiguous spoken requests for robots in physical scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a semantic text description of a user's eye-movement scanpath, when supplied to a large language model together with spoken words, enables the model to ground underspecified requests in the current physical environment. A sympathetic reader would care because this removes the need for perfectly explicit speech or separate gaze classifiers, letting the same model handle both verbal and nonverbal input. The method shows the model can disregard irrelevant glances and still identify the intended objects across varied tasks. Validation occurs in two scenarios with multiple tasks, where the combined input outperforms control conditions that lack the gaze translation. The approach is demonstrated end-to-end on a physical robot that executes the interpreted request.

Core claim

The paper claims that large language models can reason about referential gaze once the raw scanpath is rendered as text; the resulting combined input allows the model to relate ambiguous speech to the scene and user intent while ignoring spurious glances, yielding higher generality and accuracy than baselines across tasks and scenarios, and permitting direct translation into robot actions.

What carries the argument

The text-based semantic translation of the user's gaze scanpath, which turns eye-movement sequences into a readable description that an LLM processes jointly with speech.

If this is right

Robots can interpret requests that mention objects only by gaze rather than by name.
The same LLM handles both speech and gaze without separate specialized modules for each modality.
Performance holds across different physical layouts and task types without per-task retraining.
The system produces executable robot commands directly from the joint interpretation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same translation technique could be tested with other transient signals such as pointing gestures or head orientation.
If the translation remains stable under head movement or changing lighting, the method would apply to mobile robots in less controlled spaces.
Extending the approach to multi-turn dialogues might allow the model to maintain and update scene references across exchanges.
The method suggests that other embodied AI systems could benefit from converting sensor streams into text for unified LLM reasoning.

Load-bearing premise

The semantic translation step converts the raw gaze data into text without distorting or losing the user's actual referential intent.

What would settle it

A controlled trial in which the LLM, given the translated scanpath and speech, repeatedly selects irrelevant objects that appeared only in spurious glances, or shows no accuracy gain over speech-only baselines in new tasks.

Figures

Figures reproduced from arXiv: 2503.16548 by Anna Belardinelli, Carlos Balaguer, Elisabeth Menendez, Michael Gienger, Santiago Mart\'inez.

**Figure 2.** Figure 2: Visualization of gaze history and speech input over [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Top row: Accuracy with respect to the ground truth inference in the breakfast (a) and drink (b) scenarios. When the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Robot demonstration illustrating the LLM’s ability to disambiguate user requests by integrating gaze history and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have substantially improved the conversational capabilities of social robots. Nevertheless, for an intuitive and fluent human-robot interaction, robots should be able to ground the conversation by relating ambiguous or underspecified spoken utterances to the current physical situation and to the intents expressed nonverbally by the user, such as through referential gaze. Here, we propose a representation that integrates speech and gaze to enable LLMs to achieve higher situated awareness and correctly resolve ambiguous requests. Our approach relies on a text-based semantic translation of the scanpath produced by the user, along with the verbal requests. It demonstrates LLMs' capabilities to reason about gaze behavior, robustly ignoring spurious glances or irrelevant objects. We validate the system across multiple tasks and two scenarios, showing its superior generality and accuracy compared to control conditions. We demonstrate an implementation on a robotic platform, closing the loop from request interpretation to execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns gaze scanpaths into text for LLMs to disambiguate spoken requests in HRI, but the abstract and stress-test note leave the translation step and all quantitative results undescribed.

read the letter

The core idea here is a text-based encoding of a user's gaze scanpath that gets passed to an LLM together with speech, so the model can resolve ambiguous commands by reasoning about what the person was looking at. That specific combination for situated HRI looks new relative to the gaze and grounding work they cite, and the robot implementation that closes the loop from interpretation to action is a concrete engineering step worth noting. They claim the LLM can ignore spurious glances and that the system beats control conditions across tasks and scenarios, which would be useful if it holds up. The main soft spot is exactly the one the stress-test flags: the abstract gives no description of how the raw scanpath becomes semantic text, no metrics, no error analysis, and no baseline details. Without that, you cannot tell whether the robustness comes from LLM reasoning or from whatever rules or tuning went into the translation step. The paper is an engineering integration rather than a formal derivation, so the absence of those details makes the central claim rest on unshown evidence. This is aimed at HRI researchers who need practical multimodal grounding; the idea is straightforward enough that a serious referee could evaluate the full evaluation and the translation procedure. I would send it to review rather than desk reject, mainly to see the numbers and the exact pipeline.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces SemanticScanpath, a system that converts user gaze scanpaths into text-based semantic representations combined with spoken requests, enabling LLMs to resolve ambiguous or underspecified utterances in situated human-robot interaction. It claims this approach demonstrates LLMs' ability to reason about gaze behavior by robustly ignoring spurious glances or irrelevant objects, with validation showing superior generality and accuracy over control conditions across multiple tasks in two scenarios, plus a closed-loop robotic implementation.

Significance. If the quantitative results and implementation details hold, the work could advance situated HRI by providing a general, LLM-based method for multimodal grounding of speech with referential gaze without requiring task-specific models. The engineering integration of gaze-to-text translation is a practical contribution, though the absence of reported metrics, error analysis, or baseline details in the abstract makes the claimed robustness difficult to assess at present.

major comments (2)

[Abstract] Abstract: the central claim of 'superior generality and accuracy' and 'robustly ignoring spurious glances' is asserted without any quantitative metrics, error rates, statistical comparisons, or details on how baselines were implemented or controlled, leaving the evidence for LLM reasoning about gaze unshown and the soundness of the validation unclear.
[Approach description] Approach description (abstract and method): the 'text-based semantic translation of the scanpath' is presented only at a high level with no specification of the procedure, any filtering/summarization rules, heuristics, or potential scenario-specific tuning; without these details it is impossible to determine whether observed performance and robustness should be attributed to LLM reasoning or to the translation pipeline itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details and metrics as suggested.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'superior generality and accuracy' and 'robustly ignoring spurious glances' is asserted without any quantitative metrics, error rates, statistical comparisons, or details on how baselines were implemented or controlled, leaving the evidence for LLM reasoning about gaze unshown and the soundness of the validation unclear.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised version, we will update the abstract to summarize specific metrics from the evaluation (such as accuracy and generality comparisons across tasks and scenarios) along with brief notes on baseline controls. The full manuscript already contains these results and statistical details in the validation sections, but adding them to the abstract will make the evidence for the LLM's gaze reasoning more immediately apparent. revision: yes
Referee: [Approach description] Approach description (abstract and method): the 'text-based semantic translation of the scanpath' is presented only at a high level with no specification of the procedure, any filtering/summarization rules, heuristics, or potential scenario-specific tuning; without these details it is impossible to determine whether observed performance and robustness should be attributed to LLM reasoning or to the translation pipeline itself.

Authors: We acknowledge that the current description of the scanpath-to-text translation is high-level. We will expand the method section in the revision to provide a detailed specification of the translation procedure, including the exact steps for generating the semantic text representation, any filtering or summarization rules applied to the scanpath, and clarification on the absence of scenario-specific tuning. This will help readers distinguish the contributions of the LLM reasoning from the preprocessing steps. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering system description with no derivations, fitted parameters, or self-referential reductions

full rationale

The paper presents an applied integration of gaze scanpath translation into text for LLM-based HRI, with no equations, parameter fitting, uniqueness theorems, or derivation chains. The central claim rests on empirical validation across tasks rather than any mathematical reduction to inputs. No self-citations are invoked as load-bearing premises, and the semantic translation step is described as a representation choice without being redefined in terms of the LLM output it enables. This matches the default expectation of a non-circular engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the central claim rests on the unexamined assumption that LLM reasoning over the generated text will be reliable and that the scanpath-to-text step faithfully encodes intent.

pith-pipeline@v0.9.0 · 5700 in / 1113 out tokens · 16882 ms · 2026-05-22T23:59:33.544798+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

Service robots in the healthcare sector,

J. Holland, L. Kingston, C. McCarthy et al. , “Service robots in the healthcare sector,” Robotics, vol. 10, no. 1, p. 47, 2021

work page 2021
[2]

Adam: a robotic compan- ion for enhanced quality of life in aging populations,

A. Mora, A. Prados, A. Mendez et al. , “Adam: a robotic compan- ion for enhanced quality of life in aging populations,” Frontiers in Neurorobotics, vol. 18, p. 1337608, 2024

work page 2024
[3]

Planning with verbal communication for human-robot collaboration,

S. Nikolaidis, M. Kwon, J. Forlizzi et al. , “Planning with verbal communication for human-robot collaboration,” ACM Transactions on Human-Robot Interaction (THRI) , vol. 7, no. 3, pp. 1–21, 2018

work page 2018
[4]

Spoken language inter- action with robots: Recommendations for future research,

M. Marge, C. Espy-Wilson, N. G. Ward et al., “Spoken language inter- action with robots: Recommendations for future research,” Computer Speech & Language , vol. 71, p. 101255, 2022

work page 2022
[5]

A survey on dialogue management in human-robot interaction,

M. M. Reimann, F. A. Kunneman, C. Oertel et al. , “A survey on dialogue management in human-robot interaction,” ACM Transactions on Human-Robot Interaction , 2024

work page 2024
[6]

Embodied agent interface: Bench- marking llms for embodied decision making,

M. Li, S. Zhao, Q. Wang et al. , “Embodied agent interface: Bench- marking llms for embodied decision making,” Advances in Neural Information Processing Systems , vol. 37, pp. 100 428–100 534, 2025

work page 2025
[7]

CoPAL: corrective planning of robot actions with large language models,

F. Joublin, A. Ceravola, P. Smirnov et al., “CoPAL: corrective planning of robot actions with large language models,” in 2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA) . IEEE, 2024, pp. 8664–8670

work page 2024
[8]

Robots that use lan- guage,

S. Tellex, N. Gopalan, H. Kress-Gazit et al. , “Robots that use lan- guage,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, no. 1, pp. 25–55, 2020

work page 2020
[9]

Vision-language model-driven scene understanding and robotic object manipulation,

S. Liu, J. Zhang, R. X. Gao et al. , “Vision-language model-driven scene understanding and robotic object manipulation,” in 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE). IEEE, 2024, pp. 21–26

work page 2024
[10]

Speakers’ eye gaze disambiguates referring expressions early during face-to-face conversation,

J. E. Hanna and S. E. Brennan, “Speakers’ eye gaze disambiguates referring expressions early during face-to-face conversation,” Journal of Memory and Language , vol. 57, no. 4, pp. 596–615, 2007

work page 2007
[11]

The utility of gaze in spoken human- robot interaction,

M. Staudte and M. Crocker, “The utility of gaze in spoken human- robot interaction,” in Proceedings of Workshop on Metrics for Human- Robot Interaction 2008 , 2008, pp. 53–59

work page 2008
[12]

A constructive model for the development of joint attention,

Y . Nagai, K. Hosoda, A. Morita et al., “A constructive model for the development of joint attention,” Connection Science , vol. 15, no. 4, pp. 211–229, 2003

work page 2003
[13]

Social eye gaze in human-robot interaction: a review,

H. Admoni and B. Scassellati, “Social eye gaze in human-robot interaction: a review,” Journal of Human-Robot Interaction , vol. 6, no. 1, pp. 25–63, 2017

work page 2017
[14]

Gaze-based intention estimation: principles, method- ologies, and applications in hri,

A. Belardinelli, “Gaze-based intention estimation: principles, method- ologies, and applications in hri,” ACM Transactions on Human-Robot Interaction, vol. 13, no. 3, pp. 1–30, 2024

work page 2024
[15]

Integrating egocentric and robotic vision for object identification using siamese networks and superquadric estimations in partial occlusion scenarios,

E. Menendez, S. Mart ´ınez, F. D ´ıaz-de Mar´ıa et al. , “Integrating egocentric and robotic vision for object identification using siamese networks and superquadric estimations in partial occlusion scenarios,” Biomimetics, vol. 9, no. 2, p. 100, 2024

work page 2024
[16]

Situated open world ref- erence resolution for human-robot dialogue,

T. Williams, S. Acharya, S. Schreitter et al., “Situated open world ref- erence resolution for human-robot dialogue,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI) . IEEE, 2016, pp. 311–318

work page 2016
[17]

To help or not to help: Llm- based attentive support for human-robot group interactions,

D. Tanneberg, F. Ocker, S. Hasler et al., “To help or not to help: Llm- based attentive support for human-robot group interactions,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024, pp. 9130–9137

work page 2024
[18]

Lami: Large language mod- els for multi-modal human-robot interaction,

C. Wang, S. Hasler, D. Tanneberg et al., “Lami: Large language mod- els for multi-modal human-robot interaction,” in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , 2024, pp. 1–10

work page 2024
[19]

Situated dialogue pro- cessing for human-robot interaction,

G.-J. M. Kruijff, P. Lison, T. Benjamin et al., “Situated dialogue pro- cessing for human-robot interaction,” in Cognitive systems. Springer, 2010, pp. 311–364

work page 2010
[20]

Going beyond literal command-based instructions: Extending robotic natural language in- teraction capabilities,

T. Williams, G. Briggs, B. Oosterveld et al. , “Going beyond literal command-based instructions: Extending robotic natural language in- teraction capabilities,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015

work page 2015
[21]

Collaborative effort towards common ground in situated human-robot dialogue,

J. Y . Chai, L. She, R. Fang et al. , “Collaborative effort towards common ground in situated human-robot dialogue,” in 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2014, pp. 33–40

work page 2014
[22]

Semantically-driven disambigua- tion for human-robot interaction,

F. I. Dogan, W. Liu, I. Leite et al., “Semantically-driven disambigua- tion for human-robot interaction,” arXiv preprint arXiv:2409.17004 , 2024

work page arXiv 2024
[23]

HandMe That: Human-robot communication in physical and social environments,

Y . Wan, J. Mao, and J. Tenenbaum, “HandMe That: Human-robot communication in physical and social environments,” Advances in Neural Information Processing Systems , vol. 35, pp. 12 014–12 026, 2022

work page 2022
[24]

The reliability of non-verbal cues for situated reference resolution and their interplay with language: implications for human robot interaction,

S. Gross, B. Krenn, and M. Scheutz, “The reliability of non-verbal cues for situated reference resolution and their interplay with language: implications for human robot interaction,” in Proceedings of the 19th ACM international conference on multimodal interaction , 2017, pp. 189–196

work page 2017
[25]

Language, common sense, and the Wino- grad schema challenge,

J. Browning and Y . LeCun, “Language, common sense, and the Wino- grad schema challenge,” Artificial Intelligence, vol. 325, p. 104031, 2023

work page 2023
[26]

V oila-a: Aligning vision-language models with user’s gaze attention,

K. Yan, Z. Wang, L. Ji et al. , “V oila-a: Aligning vision-language models with user’s gaze attention,” Advances in Neural Information Processing Systems, vol. 37, pp. 1890–1918, 2025

work page 1918
[27]

GazePointAR: A context-aware multimodal voice assistant for pronoun disambiguation in wearable augmented reality,

J. Lee, J. Wang, E. Brown et al. , “GazePointAR: A context-aware multimodal voice assistant for pronoun disambiguation in wearable augmented reality,” in Proceedings of the CHI Conference on Human Factors in Computing Systems , 2024, pp. 1–20

work page 2024
[28]

GazeGPT: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear,

R. Konrad, N. Padmanaban, J. G. Buckmaster et al. , “GazeGPT: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear,” arXiv preprint arXiv:2401.17217 , 2024

work page arXiv 2024
[29]

Specifying target objects in robot teleoperation using speech and natural eye gaze,

Y .-C. Chang, N. Gandi, K. Shin et al. , “Specifying target objects in robot teleoperation using speech and natural eye gaze,” in 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Hu- manoids). IEEE, 2023, pp. 1–7

work page 2023
[30]

Understanding large-language model (llm)-powered human-robot interaction,

C. Y . Kim, C. P. Lee, and B. Mutlu, “Understanding large-language model (llm)-powered human-robot interaction,” in Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interac- tion, 2024, pp. 371–380

work page 2024
[31]

Speaking and listening with the eyes: Gaze signaling during dyadic interactions,

S. Ho, T. Foulsham, and A. Kingstone, “Speaking and listening with the eyes: Gaze signaling during dyadic interactions,” PloS one, vol. 10, no. 8, p. e0136905, 2015

work page 2015
[32]

Looking coordinated: Bidi- rectional gaze mechanisms for collaborative interaction with virtual characters,

S. Andrist, M. Gleicher, and B. Mutlu, “Looking coordinated: Bidi- rectional gaze mechanisms for collaborative interaction with virtual characters,” in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems , ser. CHI ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 2571–2582

work page 2017
[33]

Head pose as a proxy for gaze in virtual reality,

P. Higgins, R. Barron, and C. Matuszek, “Head pose as a proxy for gaze in virtual reality,” in 5th international workshop on virtual, augmented, and mixed reality for HRI , 2022

work page 2022
[34]

Object-aware gaze target detection,

F. Tonini, N. Dall’Asen, C. Beyan et al. , “Object-aware gaze target detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 21 860–21 869

work page 2023
[35]

A pipeline for estimating human attention toward objects with on-board cameras on the icub humanoid robot,

S. Hanifi, E. Maiettini, M. Lombardi et al., “A pipeline for estimating human attention toward objects with on-board cameras on the icub humanoid robot,” Frontiers in Robotics and AI , vol. 11, p. 1346714, 2024

work page 2024
[36]

A review of machine learning in scanpath analysis for passive gaze-based interaction,

A. Mohamed Selim, M. Barz, O. S. Bhatti et al., “A review of machine learning in scanpath analysis for passive gaze-based interaction,” Frontiers in Artificial Intelligence, vol. 7, p. 1391745, 2024

work page 2024
[37]

Can you pass that tool?: Implications of indirect speech in physical human-robot collaboration,

Y . Zhang, T. S. Ratnayake, C. Sew et al. , “Can you pass that tool?: Implications of indirect speech in physical human-robot collaboration,” arXiv preprint arXiv:2502.11720 , 2025. APPENDIX See Listing 1 for the system prompt and Table III for the available tools. Listing 1: System prompt. You are a friendly and attentive service agent. You control a phy...

work page arXiv 2025
[38]

Always start gathering all available information related to the request from the scene and the input

work page
[39]

Use gaze to clarify speech, when requests are ambiguous

Always focus on understanding the user’s intent based on context, speech input, and gaze history. Use gaze to clarify speech, when requests are ambiguous. Use speech to clarify gaze, when requests are ambiguous

work page
[40]

Be concise and clear

Provide a reason for every response to user requests using the ’reasoning’ function to explain decisions. Be concise and clear

work page
[41]

Speak out loud using the ’speak’ function to communicate clearly and concisely with the user

work page
[42]

If you are not sure about the user’s intent, ask for clarification

work page
[43]

REMEMBER YOUR RULES!! TIPS FOR INTERPRETING GAZE:

Provide the ’required_objects’ for every user request. REMEMBER YOUR RULES!! TIPS FOR INTERPRETING GAZE:

work page
[44]

Referred objects are usually gazed ahead of utterance, but also right before looking at you

work page
[45]

Intentionally referred objects are usually looked at longer and more frequently

work page
[46]

TABLE III: Overview of Available Tools and Their Arguments Tool Description Arguments Query Tools query objects Query all objects that are available in the scene

Spurious fixations are usually short and mixed with closer objects. TABLE III: Overview of Available Tools and Their Arguments Tool Description Arguments Query Tools query objects Query all objects that are available in the scene. You can see all these objects. - Diagnostic Tools reasoning You provide a reason for the action you are about to take. - requi...

work page

[1] [1]

Service robots in the healthcare sector,

J. Holland, L. Kingston, C. McCarthy et al. , “Service robots in the healthcare sector,” Robotics, vol. 10, no. 1, p. 47, 2021

work page 2021

[2] [2]

Adam: a robotic compan- ion for enhanced quality of life in aging populations,

A. Mora, A. Prados, A. Mendez et al. , “Adam: a robotic compan- ion for enhanced quality of life in aging populations,” Frontiers in Neurorobotics, vol. 18, p. 1337608, 2024

work page 2024

[3] [3]

Planning with verbal communication for human-robot collaboration,

S. Nikolaidis, M. Kwon, J. Forlizzi et al. , “Planning with verbal communication for human-robot collaboration,” ACM Transactions on Human-Robot Interaction (THRI) , vol. 7, no. 3, pp. 1–21, 2018

work page 2018

[4] [4]

Spoken language inter- action with robots: Recommendations for future research,

M. Marge, C. Espy-Wilson, N. G. Ward et al., “Spoken language inter- action with robots: Recommendations for future research,” Computer Speech & Language , vol. 71, p. 101255, 2022

work page 2022

[5] [5]

A survey on dialogue management in human-robot interaction,

M. M. Reimann, F. A. Kunneman, C. Oertel et al. , “A survey on dialogue management in human-robot interaction,” ACM Transactions on Human-Robot Interaction , 2024

work page 2024

[6] [6]

Embodied agent interface: Bench- marking llms for embodied decision making,

M. Li, S. Zhao, Q. Wang et al. , “Embodied agent interface: Bench- marking llms for embodied decision making,” Advances in Neural Information Processing Systems , vol. 37, pp. 100 428–100 534, 2025

work page 2025

[7] [7]

CoPAL: corrective planning of robot actions with large language models,

F. Joublin, A. Ceravola, P. Smirnov et al., “CoPAL: corrective planning of robot actions with large language models,” in 2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA) . IEEE, 2024, pp. 8664–8670

work page 2024

[8] [8]

Robots that use lan- guage,

S. Tellex, N. Gopalan, H. Kress-Gazit et al. , “Robots that use lan- guage,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, no. 1, pp. 25–55, 2020

work page 2020

[9] [9]

Vision-language model-driven scene understanding and robotic object manipulation,

S. Liu, J. Zhang, R. X. Gao et al. , “Vision-language model-driven scene understanding and robotic object manipulation,” in 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE). IEEE, 2024, pp. 21–26

work page 2024

[10] [10]

Speakers’ eye gaze disambiguates referring expressions early during face-to-face conversation,

J. E. Hanna and S. E. Brennan, “Speakers’ eye gaze disambiguates referring expressions early during face-to-face conversation,” Journal of Memory and Language , vol. 57, no. 4, pp. 596–615, 2007

work page 2007

[11] [11]

The utility of gaze in spoken human- robot interaction,

M. Staudte and M. Crocker, “The utility of gaze in spoken human- robot interaction,” in Proceedings of Workshop on Metrics for Human- Robot Interaction 2008 , 2008, pp. 53–59

work page 2008

[12] [12]

A constructive model for the development of joint attention,

Y . Nagai, K. Hosoda, A. Morita et al., “A constructive model for the development of joint attention,” Connection Science , vol. 15, no. 4, pp. 211–229, 2003

work page 2003

[13] [13]

Social eye gaze in human-robot interaction: a review,

H. Admoni and B. Scassellati, “Social eye gaze in human-robot interaction: a review,” Journal of Human-Robot Interaction , vol. 6, no. 1, pp. 25–63, 2017

work page 2017

[14] [14]

Gaze-based intention estimation: principles, method- ologies, and applications in hri,

A. Belardinelli, “Gaze-based intention estimation: principles, method- ologies, and applications in hri,” ACM Transactions on Human-Robot Interaction, vol. 13, no. 3, pp. 1–30, 2024

work page 2024

[15] [15]

Integrating egocentric and robotic vision for object identification using siamese networks and superquadric estimations in partial occlusion scenarios,

E. Menendez, S. Mart ´ınez, F. D ´ıaz-de Mar´ıa et al. , “Integrating egocentric and robotic vision for object identification using siamese networks and superquadric estimations in partial occlusion scenarios,” Biomimetics, vol. 9, no. 2, p. 100, 2024

work page 2024

[16] [16]

Situated open world ref- erence resolution for human-robot dialogue,

T. Williams, S. Acharya, S. Schreitter et al., “Situated open world ref- erence resolution for human-robot dialogue,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI) . IEEE, 2016, pp. 311–318

work page 2016

[17] [17]

To help or not to help: Llm- based attentive support for human-robot group interactions,

D. Tanneberg, F. Ocker, S. Hasler et al., “To help or not to help: Llm- based attentive support for human-robot group interactions,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024, pp. 9130–9137

work page 2024

[18] [18]

Lami: Large language mod- els for multi-modal human-robot interaction,

C. Wang, S. Hasler, D. Tanneberg et al., “Lami: Large language mod- els for multi-modal human-robot interaction,” in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , 2024, pp. 1–10

work page 2024

[19] [19]

Situated dialogue pro- cessing for human-robot interaction,

G.-J. M. Kruijff, P. Lison, T. Benjamin et al., “Situated dialogue pro- cessing for human-robot interaction,” in Cognitive systems. Springer, 2010, pp. 311–364

work page 2010

[20] [20]

Going beyond literal command-based instructions: Extending robotic natural language in- teraction capabilities,

T. Williams, G. Briggs, B. Oosterveld et al. , “Going beyond literal command-based instructions: Extending robotic natural language in- teraction capabilities,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015

work page 2015

[21] [21]

Collaborative effort towards common ground in situated human-robot dialogue,

J. Y . Chai, L. She, R. Fang et al. , “Collaborative effort towards common ground in situated human-robot dialogue,” in 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2014, pp. 33–40

work page 2014

[22] [22]

Semantically-driven disambigua- tion for human-robot interaction,

F. I. Dogan, W. Liu, I. Leite et al., “Semantically-driven disambigua- tion for human-robot interaction,” arXiv preprint arXiv:2409.17004 , 2024

work page arXiv 2024

[23] [23]

HandMe That: Human-robot communication in physical and social environments,

Y . Wan, J. Mao, and J. Tenenbaum, “HandMe That: Human-robot communication in physical and social environments,” Advances in Neural Information Processing Systems , vol. 35, pp. 12 014–12 026, 2022

work page 2022

[24] [24]

The reliability of non-verbal cues for situated reference resolution and their interplay with language: implications for human robot interaction,

S. Gross, B. Krenn, and M. Scheutz, “The reliability of non-verbal cues for situated reference resolution and their interplay with language: implications for human robot interaction,” in Proceedings of the 19th ACM international conference on multimodal interaction , 2017, pp. 189–196

work page 2017

[25] [25]

Language, common sense, and the Wino- grad schema challenge,

J. Browning and Y . LeCun, “Language, common sense, and the Wino- grad schema challenge,” Artificial Intelligence, vol. 325, p. 104031, 2023

work page 2023

[26] [26]

V oila-a: Aligning vision-language models with user’s gaze attention,

K. Yan, Z. Wang, L. Ji et al. , “V oila-a: Aligning vision-language models with user’s gaze attention,” Advances in Neural Information Processing Systems, vol. 37, pp. 1890–1918, 2025

work page 1918

[27] [27]

GazePointAR: A context-aware multimodal voice assistant for pronoun disambiguation in wearable augmented reality,

J. Lee, J. Wang, E. Brown et al. , “GazePointAR: A context-aware multimodal voice assistant for pronoun disambiguation in wearable augmented reality,” in Proceedings of the CHI Conference on Human Factors in Computing Systems , 2024, pp. 1–20

work page 2024

[28] [28]

GazeGPT: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear,

R. Konrad, N. Padmanaban, J. G. Buckmaster et al. , “GazeGPT: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear,” arXiv preprint arXiv:2401.17217 , 2024

work page arXiv 2024

[29] [29]

Specifying target objects in robot teleoperation using speech and natural eye gaze,

Y .-C. Chang, N. Gandi, K. Shin et al. , “Specifying target objects in robot teleoperation using speech and natural eye gaze,” in 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Hu- manoids). IEEE, 2023, pp. 1–7

work page 2023

[30] [30]

Understanding large-language model (llm)-powered human-robot interaction,

C. Y . Kim, C. P. Lee, and B. Mutlu, “Understanding large-language model (llm)-powered human-robot interaction,” in Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interac- tion, 2024, pp. 371–380

work page 2024

[31] [31]

Speaking and listening with the eyes: Gaze signaling during dyadic interactions,

S. Ho, T. Foulsham, and A. Kingstone, “Speaking and listening with the eyes: Gaze signaling during dyadic interactions,” PloS one, vol. 10, no. 8, p. e0136905, 2015

work page 2015

[32] [32]

Looking coordinated: Bidi- rectional gaze mechanisms for collaborative interaction with virtual characters,

S. Andrist, M. Gleicher, and B. Mutlu, “Looking coordinated: Bidi- rectional gaze mechanisms for collaborative interaction with virtual characters,” in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems , ser. CHI ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 2571–2582

work page 2017

[33] [33]

Head pose as a proxy for gaze in virtual reality,

P. Higgins, R. Barron, and C. Matuszek, “Head pose as a proxy for gaze in virtual reality,” in 5th international workshop on virtual, augmented, and mixed reality for HRI , 2022

work page 2022

[34] [34]

Object-aware gaze target detection,

F. Tonini, N. Dall’Asen, C. Beyan et al. , “Object-aware gaze target detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 21 860–21 869

work page 2023

[35] [35]

A pipeline for estimating human attention toward objects with on-board cameras on the icub humanoid robot,

S. Hanifi, E. Maiettini, M. Lombardi et al., “A pipeline for estimating human attention toward objects with on-board cameras on the icub humanoid robot,” Frontiers in Robotics and AI , vol. 11, p. 1346714, 2024

work page 2024

[36] [36]

A review of machine learning in scanpath analysis for passive gaze-based interaction,

A. Mohamed Selim, M. Barz, O. S. Bhatti et al., “A review of machine learning in scanpath analysis for passive gaze-based interaction,” Frontiers in Artificial Intelligence, vol. 7, p. 1391745, 2024

work page 2024

[37] [37]

Can you pass that tool?: Implications of indirect speech in physical human-robot collaboration,

Y . Zhang, T. S. Ratnayake, C. Sew et al. , “Can you pass that tool?: Implications of indirect speech in physical human-robot collaboration,” arXiv preprint arXiv:2502.11720 , 2025. APPENDIX See Listing 1 for the system prompt and Table III for the available tools. Listing 1: System prompt. You are a friendly and attentive service agent. You control a phy...

work page arXiv 2025

[38] [38]

Always start gathering all available information related to the request from the scene and the input

work page

[39] [39]

Use gaze to clarify speech, when requests are ambiguous

Always focus on understanding the user’s intent based on context, speech input, and gaze history. Use gaze to clarify speech, when requests are ambiguous. Use speech to clarify gaze, when requests are ambiguous

work page

[40] [40]

Be concise and clear

Provide a reason for every response to user requests using the ’reasoning’ function to explain decisions. Be concise and clear

work page

[41] [41]

Speak out loud using the ’speak’ function to communicate clearly and concisely with the user

work page

[42] [42]

If you are not sure about the user’s intent, ask for clarification

work page

[43] [43]

REMEMBER YOUR RULES!! TIPS FOR INTERPRETING GAZE:

Provide the ’required_objects’ for every user request. REMEMBER YOUR RULES!! TIPS FOR INTERPRETING GAZE:

work page

[44] [44]

Referred objects are usually gazed ahead of utterance, but also right before looking at you

work page

[45] [45]

Intentionally referred objects are usually looked at longer and more frequently

work page

[46] [46]

TABLE III: Overview of Available Tools and Their Arguments Tool Description Arguments Query Tools query objects Query all objects that are available in the scene

Spurious fixations are usually short and mixed with closer objects. TABLE III: Overview of Available Tools and Their Arguments Tool Description Arguments Query Tools query objects Query all objects that are available in the scene. You can see all these objects. - Diagnostic Tools reasoning You provide a reason for the action you are about to take. - requi...

work page