GIVE: Grounding Human Gestures in Vision-Language-Action Models

Boyu Ma; Gen Li; Jianfei Yang; Jindou Jia; Junqiao Fan; Pengfei Liu; Yang Xiao

arxiv: 2606.13435 · v1 · pith:B4JI6K3Bnew · submitted 2026-06-11 · 💻 cs.RO

GIVE: Grounding Human Gestures in Vision-Language-Action Models

Pengfei Liu , Gen Li , Junqiao Fan , Boyu Ma , Jindou Jia , Yang Xiao , Jianfei Yang This is my paper

Pith reviewed 2026-06-27 06:33 UTC · model grok-4.3

classification 💻 cs.RO

keywords human gesturesvision-language-action modelshuman-robot interactionintent groundingrobotic manipulationmultimodal input

0 comments

The pith

Overlaying hand skeletons and generating gesture descriptions lets pre-trained VLA models interpret human intent more accurately during robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current VLA models for robotic manipulation treat tasks as text-only and miss the gestures people use to clarify what they want. The paper shows that two simple additions—drawing hand skeletons and fingertip rays on camera images plus writing short descriptions of the gestures—let the same models ground those intentions without any retraining or architecture changes. In real-world tests this produced 40 percent higher accuracy at identifying the right object and 80 percent higher overall task success. A sympathetic reader would care because it addresses a common failure point where spoken instructions alone are vague or incomplete.

Core claim

GIVE adds gesture information to existing VLA models through a visual pathway that overlays hand skeletons and fingertip rays on robot observations for explicit object grounding and a semantic pathway that produces high-level descriptions of gestures and task instructions for intent grounding; the two pathways together let policies associate gestures with manipulation behaviors and adapt to dynamic intents, yielding 40 percent better target recognition accuracy and 80 percent better task success in real HRI experiments while remaining robust to unseen layouts and different participants.

What carries the argument

GIVE's dual visual-semantic pathways that inject gesture data into pre-trained VLA inputs without model changes.

If this is right

VLA policies can link specific gestures to concrete manipulation actions even when language is ambiguous.
The same pre-trained model can adapt to changing interaction goals without retraining.
Performance holds across spatial layouts and participant groups not seen during development.
No architectural redesign or additional training is required to obtain the reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlay-and-description method could be tried with other non-verbal signals such as gaze or posture.
Natural language commands for robots might become shorter and less precise if gesture channels are routinely available.
Collecting paired gesture-language data during deployment could allow later fine-tuning that further reduces reliance on explicit instructions.

Load-bearing premise

That simple visual overlays and generated text descriptions of gestures are enough to produce reliable intent grounding in new layouts and with new users.

What would settle it

A new set of real-world HRI trials with unseen layouts and participants in which the target recognition or task success rates show no meaningful gain over the baseline VLA model.

read the original abstract

Human communication is inherently multimodal, where language is often accompanied by non-verbal cues such as gestures to convey intentions. However, current Vision-Language-Action (VLA) models treat robotic manipulation as a pure text-driven task, overlooking the important role of gestures in Human-Robot Interaction (HRI). This often leads to inaccurate intent grounding and unreliable manipulation when language instructions are ambiguous or underspecified. To address this challenge, we propose GIVE (Gesture Intent via Visual-Semantic Enhancement), an effective approach that enhances pre-trained VLA models with human gesture understanding without architectural modifications. Specifically, GIVE incorporates gesture information through two complementary pathways: a visual pathway that overlays hand skeletons and fingertip rays onto robot observations for explicit object grounding, and a semantic pathway that generates high-level descriptions of human gestures and task instructions for robust intent grounding. By jointly leveraging visual and semantic guidance, GIVE enables VLA policies to better associate gestures with manipulation behaviors and adapt to dynamic interaction intents. In real-world HRI experiments, GIVE substantially outperforms the baseline, improving target object recognition accuracy by 40% and overall task success rate by 80%, while demonstrating strong robustness and generalization to unseen spatial layouts and diverse participants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GIVE adds hand skeleton overlays and generated gesture descriptions to pre-trained VLA models without architecture changes, claiming 40% and 80% gains in HRI tasks, but the abstract supplies no experimental protocol or stats to evaluate those numbers.

read the letter

The core idea is to feed gesture information into existing VLA models through two routes: a visual one that draws hand skeletons and fingertip rays on the robot's camera view, and a semantic one that produces text summaries of the gestures plus the task. This is meant to improve intent grounding when spoken instructions are incomplete.

The paper does a reasonable job identifying a real limitation in current VLA work, which tends to ignore non-verbal signals that humans use all the time in collaboration. Keeping the fix lightweight by avoiding model changes is a practical choice that could let others test it quickly on their own setups.

The main weakness is the complete absence of experimental detail. The abstract states large improvements in object recognition and task success but says nothing about trial counts, baseline implementations, variance, statistical tests, or exactly how the overlays and descriptions are tokenized and inserted. Without that information the performance claims cannot be assessed, and the assertions about robustness to new layouts and participants rest on the same thin ground. The stress-test note correctly flags that no deeper technical checks are possible from the abstract alone.

This is aimed at people building or extending VLA systems for real-world HRI. A reader who wants concrete implementation ideas for adding gesture cues might extract something useful, but anyone needing verified results will have to wait for the full paper.

It deserves peer review so the experimental section can be examined properly; the idea is straightforward enough that a referee could quickly determine whether the gains are reproducible.

Referee Report

1 major / 0 minor

Summary. The paper presents GIVE (Gesture Intent via Visual-Semantic Enhancement), a method to improve pre-trained Vision-Language-Action (VLA) models by incorporating human gestures for better intent grounding in robotic manipulation tasks. It uses a visual pathway with overlays of hand skeletons and fingertip rays on robot observations, and a semantic pathway with generated high-level descriptions of gestures and instructions. The approach requires no architectural changes to the VLA model. Real-world experiments show 40% improvement in target object recognition accuracy and 80% in task success rate compared to baseline, with robustness to unseen layouts and diverse participants.

Significance. If the reported improvements hold under rigorous evaluation, this work could meaningfully advance the field of human-robot interaction by enabling VLA models to leverage non-verbal cues like gestures, which are common in human communication but currently ignored in text-driven VLA approaches. It provides a simple, architecture-agnostic way to integrate multimodal inputs.

major comments (1)

[Abstract] Abstract: The central empirical claims of 40% improvement in target object recognition accuracy and 80% in overall task success rate are stated without any details on the experimental protocol, number of trials, baseline implementation, variance across runs, number of participants, or statistical tests. This information is essential to assess whether the data supports the claims of substantial outperformance and generalization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract regarding our experimental claims. We agree that including key details on the protocol will strengthen the presentation and will revise the abstract accordingly while ensuring the full experimental details remain in the methods and results sections.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims of 40% improvement in target object recognition accuracy and 80% in overall task success rate are stated without any details on the experimental protocol, number of trials, baseline implementation, variance across runs, number of participants, or statistical tests. This information is essential to assess whether the data supports the claims of substantial outperformance and generalization.

Authors: We acknowledge that the current abstract is concise and omits these specifics. In the revised manuscript, we will expand the abstract to include a brief summary of the experimental protocol (real-world HRI setup with multiple participants and trials), the baseline VLA implementation, number of trials and participants, and mention of statistical tests confirming the reported improvements. Full details, including variance and per-participant results, are already provided in Section 4; the abstract revision will make these claims more self-contained without altering the reported numbers. revision: yes

Circularity Check

0 steps flagged

Empirical method with no derivations or self-referential predictions

full rationale

The paper describes a practical method for enhancing pre-trained VLA models by adding gesture information through visual overlays (hand skeletons and rays) and generated semantic descriptions, followed by real-world HRI experiments reporting accuracy and success rate improvements. No equations, parameter fitting, derivation chains, or self-citations are present in the provided text that could reduce any claim to its own inputs by construction. The central claims rest on experimental outcomes rather than any logical or definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied method paper that augments existing pre-trained models. No new free parameters, axioms, or invented entities are introduced; the VLA base models and their parameters are taken from prior literature.

pith-pipeline@v0.9.1-grok · 5760 in / 1188 out tokens · 34214 ms · 2026-06-27T06:33:28.915974+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 8 linked inside Pith

[1]

IEEE Transactions on robotics , volume=

Gesture spotting and recognition for human--robot interaction , author=. IEEE Transactions on robotics , volume=. 2007 , publisher=

2007
[2]

Expert systems with applications , volume=

Nvp-hri: zero shot natural voice and posture-based human--robot interaction via large language model , author=. Expert systems with applications , volume=. 2025 , publisher=

2025
[3]

arXiv preprint arXiv:2605.30282 , year=

Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation , author=. arXiv preprint arXiv:2605.30282 , year=

Pith/arXiv arXiv
[4]

1992 , publisher=

Hand and mind: What gestures reveal about thought , author=. 1992 , publisher=

1992
[5]

2010 , publisher=

Kinesics and context: Essays on body motion communication , author=. 2010 , publisher=

2010
[6]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Learning precise affordances from egocentric videos for robotic manipulation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[7]

Nature Machine Intelligence , volume=

Embodied large language models enable robots to complete complex tasks in unpredictable environments , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

2025
[8]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=
[9]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Evo-1: Lightweight vision-language-action model with preserved semantic alignment , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=
[10]

arXiv preprint arXiv:2507.00416 , year=

Evo-0: Vision-language-action model with implicit spatial understanding , author=. arXiv preprint arXiv:2507.00416 , year=

arXiv
[11]

arXiv preprint arXiv:2605.14950 , year=

Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model , author=. arXiv preprint arXiv:2605.14950 , year=

Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2212.06817 , year=

Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=

Pith/arXiv arXiv
[13]

Conference on Robot Learning , pages=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

2023
[14]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0 , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

2024
[15]

Intelligence and K

P. Intelligence and K. Black and N. Brown and J. Darpinian and K. Dhabalia and D. Driess and A. Esmail and M. Equi and C. Finn and N. Fusai and others , journal=
[16]

International Journal of Industrial Ergonomics , volume=

Gesture recognition for human-robot collaboration: A review , author=. International Journal of Industrial Ergonomics , volume=. 2018 , publisher=

2018
[17]

2017 26th IEEE international symposium on robot and human interactive communication (RO-MAN) , pages=

Proactive, incremental learning of gesture-action associations for human-robot collaboration , author=. 2017 26th IEEE international symposium on robot and human interactive communication (RO-MAN) , pages=. 2017 , organization=

2017
[18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Reconstructing hands in 3d with transformers , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[19]

arXiv preprint arXiv:2210.02747 , year=

Flow matching for generative modeling , author=. arXiv preprint arXiv:2210.02747 , year=

Pith/arXiv arXiv
[20]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[21]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[22]

arXiv preprint arXiv:2406.09246 , year=

Openvla: An open-source vision-language-action model , author=. arXiv preprint arXiv:2406.09246 , year=

Pith/arXiv arXiv
[23]

9th Annual Conference on Robot Learning , year=

Pointing3D: A Benchmark for 3D Object Referral via Pointing Gestures , author=. 9th Annual Conference on Robot Learning , year=
[24]

2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Robonurse-vla: Robotic scrub nurse system based on vision-language-action model , author=. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2025 , organization=

2025
[25]

2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

IDAGC: Adaptive generalized human-robot collaboration via human intent estimation and multimodal policy learning , author=. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2025 , organization=

2025
[26]

arXiv preprint arXiv:2510.07778 , year=

IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction , author=. arXiv preprint arXiv:2510.07778 , year=

arXiv
[27]

European Conference on Computer Vision , pages=

I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image , author=. European Conference on Computer Vision , pages=. 2020 , organization=

2020
[28]

Proceedings of the IEEE international conference on computer vision , pages=

Learning to estimate 3d hand pose from single rgb images , author=. Proceedings of the IEEE international conference on computer vision , pages=
[29]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

3d hand shape and pose estimation from a single rgb image , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[30]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Diver interest via pointing: Human-directed object inspection for AUVs , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

2023
[31]

2022 International Conference on Robotics and Automation (ICRA) , pages=

Augmented pointing gesture estimation for human-robot interaction , author=. 2022 International Conference on Robotics and Automation (ICRA) , pages=. 2022 , organization=

2022
[32]

2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages=

Gestllm: Advanced hand gesture interpretation via large language models for human-robot interaction , author=. 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages=. 2025 , organization=

2025
[33]

7th Annual Conference on Robot Learning , year=

Gesture-informed robot assistance via foundation models , author=. 7th Annual Conference on Robot Learning , year=
[34]

SmartBot , volume=

Embodied ai: A survey on the evolution from perceptive to behavioral intelligence , author=. SmartBot , volume=. 2025 , publisher=

2025
[35]

arXiv preprint arXiv:2605.00080 , year=

World Model for Robot Learning: A Comprehensive Survey , author=. arXiv preprint arXiv:2605.00080 , year=

Pith/arXiv arXiv
[36]

Journal of Human-Robot Interaction , volume=

Towards seamless human-robot handovers , author=. Journal of Human-Robot Interaction , volume=
[37]

2011 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages=

Human preferences for robot-human hand-over configurations , author=. 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages=. 2011 , organization=

2011
[38]

arXiv preprint arXiv:2405.12213 , year=

Octo: An open-source generalist robot policy , author=. arXiv preprint arXiv:2405.12213 , year=

Pith/arXiv arXiv
[39]

Foundations and trends in human--computer interaction , volume=

Human--robot interaction: a survey , author=. Foundations and trends in human--computer interaction , volume=. 2008 , publisher=

2008
[40]

arXiv preprint arXiv:2312.11805 , year=

Gemini: A family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

Pith/arXiv arXiv

[1] [1]

IEEE Transactions on robotics , volume=

Gesture spotting and recognition for human--robot interaction , author=. IEEE Transactions on robotics , volume=. 2007 , publisher=

2007

[2] [2]

Expert systems with applications , volume=

Nvp-hri: zero shot natural voice and posture-based human--robot interaction via large language model , author=. Expert systems with applications , volume=. 2025 , publisher=

2025

[3] [3]

arXiv preprint arXiv:2605.30282 , year=

Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation , author=. arXiv preprint arXiv:2605.30282 , year=

Pith/arXiv arXiv

[4] [4]

1992 , publisher=

Hand and mind: What gestures reveal about thought , author=. 1992 , publisher=

1992

[5] [5]

2010 , publisher=

Kinesics and context: Essays on body motion communication , author=. 2010 , publisher=

2010

[6] [6]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Learning precise affordances from egocentric videos for robotic manipulation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[7] [7]

Nature Machine Intelligence , volume=

Embodied large language models enable robots to complete complex tasks in unpredictable environments , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

2025

[8] [8]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

[9] [9]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Evo-1: Lightweight vision-language-action model with preserved semantic alignment , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

[10] [10]

arXiv preprint arXiv:2507.00416 , year=

Evo-0: Vision-language-action model with implicit spatial understanding , author=. arXiv preprint arXiv:2507.00416 , year=

arXiv

[11] [11]

arXiv preprint arXiv:2605.14950 , year=

Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model , author=. arXiv preprint arXiv:2605.14950 , year=

Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2212.06817 , year=

Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=

Pith/arXiv arXiv

[13] [13]

Conference on Robot Learning , pages=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

2023

[14] [14]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0 , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

2024

[15] [15]

Intelligence and K

P. Intelligence and K. Black and N. Brown and J. Darpinian and K. Dhabalia and D. Driess and A. Esmail and M. Equi and C. Finn and N. Fusai and others , journal=

[16] [16]

International Journal of Industrial Ergonomics , volume=

Gesture recognition for human-robot collaboration: A review , author=. International Journal of Industrial Ergonomics , volume=. 2018 , publisher=

2018

[17] [17]

2017 26th IEEE international symposium on robot and human interactive communication (RO-MAN) , pages=

Proactive, incremental learning of gesture-action associations for human-robot collaboration , author=. 2017 26th IEEE international symposium on robot and human interactive communication (RO-MAN) , pages=. 2017 , organization=

2017

[18] [18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Reconstructing hands in 3d with transformers , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[19] [19]

arXiv preprint arXiv:2210.02747 , year=

Flow matching for generative modeling , author=. arXiv preprint arXiv:2210.02747 , year=

Pith/arXiv arXiv

[20] [20]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[21] [21]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[22] [22]

arXiv preprint arXiv:2406.09246 , year=

Openvla: An open-source vision-language-action model , author=. arXiv preprint arXiv:2406.09246 , year=

Pith/arXiv arXiv

[23] [23]

9th Annual Conference on Robot Learning , year=

Pointing3D: A Benchmark for 3D Object Referral via Pointing Gestures , author=. 9th Annual Conference on Robot Learning , year=

[24] [24]

2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Robonurse-vla: Robotic scrub nurse system based on vision-language-action model , author=. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2025 , organization=

2025

[25] [25]

2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

IDAGC: Adaptive generalized human-robot collaboration via human intent estimation and multimodal policy learning , author=. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2025 , organization=

2025

[26] [26]

arXiv preprint arXiv:2510.07778 , year=

IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction , author=. arXiv preprint arXiv:2510.07778 , year=

arXiv

[27] [27]

European Conference on Computer Vision , pages=

I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image , author=. European Conference on Computer Vision , pages=. 2020 , organization=

2020

[28] [28]

Proceedings of the IEEE international conference on computer vision , pages=

Learning to estimate 3d hand pose from single rgb images , author=. Proceedings of the IEEE international conference on computer vision , pages=

[29] [29]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

3d hand shape and pose estimation from a single rgb image , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[30] [30]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Diver interest via pointing: Human-directed object inspection for AUVs , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

2023

[31] [31]

2022 International Conference on Robotics and Automation (ICRA) , pages=

Augmented pointing gesture estimation for human-robot interaction , author=. 2022 International Conference on Robotics and Automation (ICRA) , pages=. 2022 , organization=

2022

[32] [32]

2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages=

Gestllm: Advanced hand gesture interpretation via large language models for human-robot interaction , author=. 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages=. 2025 , organization=

2025

[33] [33]

7th Annual Conference on Robot Learning , year=

Gesture-informed robot assistance via foundation models , author=. 7th Annual Conference on Robot Learning , year=

[34] [34]

SmartBot , volume=

Embodied ai: A survey on the evolution from perceptive to behavioral intelligence , author=. SmartBot , volume=. 2025 , publisher=

2025

[35] [35]

arXiv preprint arXiv:2605.00080 , year=

World Model for Robot Learning: A Comprehensive Survey , author=. arXiv preprint arXiv:2605.00080 , year=

Pith/arXiv arXiv

[36] [36]

Journal of Human-Robot Interaction , volume=

Towards seamless human-robot handovers , author=. Journal of Human-Robot Interaction , volume=

[37] [37]

2011 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages=

Human preferences for robot-human hand-over configurations , author=. 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages=. 2011 , organization=

2011

[38] [38]

arXiv preprint arXiv:2405.12213 , year=

Octo: An open-source generalist robot policy , author=. arXiv preprint arXiv:2405.12213 , year=

Pith/arXiv arXiv

[39] [39]

Foundations and trends in human--computer interaction , volume=

Human--robot interaction: a survey , author=. Foundations and trends in human--computer interaction , volume=. 2008 , publisher=

2008

[40] [40]

arXiv preprint arXiv:2312.11805 , year=

Gemini: A family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

Pith/arXiv arXiv