arxiv: 2604.05320 · v3 · submitted 2026-04-07 · 💻 cs.RO

Recognition: no theorem link

ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions

Souren Pashangpour , Haitong Wang , Matthew Lisondra , Goldie Nejat

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:50 UTC · model grok-4.3

classification 💻 cs.RO

keywords expressive robot behaviorshuman-robot interactionmobile manipulationvision-language modelscollaborative assemblyinterruptible interactions

0 comments

The pith

ExpressMM combines a vision-language planner with a low-level action policy to let mobile manipulators produce expressive behaviors that communicate intent during collaborative tasks with humans, including when users interrupt or redirect.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for mobile robots to perform tasks while also signaling their intentions through movement and posture in shared spaces with people. It pairs a high-level planner that uses vision-language models to interpret scenes and conversations with a low-level policy that turns those interpretations into concrete robot motions. This setup explicitly handles cases where humans change instructions partway through a task. Questionnaire results from live demonstrations indicate that observers found the robot's actions easier to read and the overall interaction more socially fitting and predictable. If the approach holds, it would mean robots could operate more fluidly alongside people without requiring constant reprogramming for every variation in human input.

Core claim

The central claim is that the ExpressMM framework integrates a high-level language-guided planner based on a vision-language model for perception and conversational reasoning with a low-level vision-language-action policy to generate expressive robot behaviors during collaborative HRI tasks. The framework further supports interruptible interactions to accommodate updated or redirecting instructions by users. In a demonstrated mobile manipulator assembly scenario, audience evaluations showed that these behaviors helped observers clearly interpret the robot's actions and intentions, supported socially appropriate and understandable interactions, and led participants to view the robot as useful

What carries the argument

The ExpressMM framework, which links a high-level vision-language model planner for scene understanding and reasoning to a low-level vision-language-action policy that produces task-aligned expressive motions.

If this is right

Expressive behaviors generated by the framework allow observers to clearly interpret the robot's actions and intentions.
The approach supports socially appropriate and understandable interactions in collaborative human-robot settings.
Participants perceive the robot as useful for collaborative tasks and as behaving in a predictable and safe manner.
Interruptible interactions enable the robot to accommodate updated or redirecting instructions from users during task execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same planner-plus-policy structure could reduce reliance on pre-programmed motions when deploying robots in varied human environments.
Clear intent communication during interruptions might lower the cognitive load on human collaborators over repeated interactions.

Load-bearing premise

The vision-language model planner combined with the action policy will reliably generate expressive behaviors that match task goals and adapt to human interruptions without producing unsafe motions or misinterpretations in real settings.

What would settle it

A live test in which a human redirects the robot mid-task and the system either produces motions that observers cannot interpret or generates unsafe movements that violate the reported safety and predictability ratings.

Figures

Figures reproduced from arXiv: 2604.05320 by Goldie Nejat, Haitong Wang, Matthew Lisondra, Souren Pashangpour.

**Figure 1.** Figure 1: Example HRI scenario during a collaborative assembly task using the proposed ExpressMM framework. The robot holds and returns a screwdriver through language-guided planning and VLA execution. Robot: ** LLM/VLM Planner Plans handover( ) à return_to_home( ) ** Robot: ** VLA executes handover( ) à return_to_home( ) ** Robot: Here you go! User: Okay, please hand me the screwdriver back so I can use it! Robot: … view at source ↗

**Figure 2.** Figure 2: The proposed expressive mobile manipulation architecture consisting of: 1) a Perception Module (PM), 2) a VLM Interaction Planner (IP), and 3) an Expressive Robot Controller (ERC). Observations Scene RGB Scene Depth Manipulator RGB Audio Signal Perception Module (PM) Current Visual Observation VLM Interaction Planner (IP) Action Parameter Generation Action Sequence Expressive Robot Controller (ERC) Vision-… view at source ↗

**Figure 3.** Figure 3: Setup of the Turtlebot-4 with SO-101 arm mobile manipulator robot interacting with a researcher in front of a classroom audience. ExpressMM Robot [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 6.** Figure 6: Box-and-whisker plots for the Percieved Usefulness [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Mobile manipulators are increasingly deployed in human-centered environments to perform tasks. While completing such tasks, they should also be able to communicate their intent to the people around them using expressive robot behaviors. Prior work on expressive robot behaviors has used preprogrammed or learning-from-demonstration-based expressive motions and large language model generated high-level interactions. The majority of these existing approaches have not considered human-robot interactions (HRI) where users may interrupt, modify, or redirect a robot's actions during task execution. In this paper, we develop the novel ExpressMM framework that integrates a high-level language-guided planner based on a vision-language model for perception and conversational reasoning with a low-level vision-language-action policy to generate expressive robot behaviors during collaborative HRI tasks. Furthermore, ExpressMM supports interruptible interactions to accommodate updated or redirecting instructions by users. We demonstrate ExpressMM on a mobile manipulator assisting a human in a collaborative assembly scenario and conduct audience-based evaluation of live HRI demonstrations. Questionnaire results show that the ExpressMM-enabled expressive behaviors helped observers clearly interpret the robot's actions and intentions while supporting socially appropriate and understandable interactions. Participants also reported that the robot was useful for collaborative tasks and behaved in a predictable and safe manner during the demonstrations, fostering positive perceptions of the robot's usefulness, safety, and predictability during the collaborative tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExpressMM integrates a VLM planner with a VLA policy for expressive interruptible mobile manipulation, but the observer-only evaluation leaves the core claims on real-time adaptation under-supported.

read the letter

The paper's main move is to combine a vision-language model for high-level planning and conversational reasoning with a vision-language-action policy for low-level control. This produces robot motions that are meant to look expressive and handle user interruptions or redirects during collaborative tasks like assembly with a mobile manipulator. Prior approaches mostly used preprogrammed motions or learning from demonstration, so the integration itself is the new piece they highlight. They run a live demo and collect questionnaire feedback showing that watchers found the behaviors clear in intent, socially appropriate, useful, safe, and predictable. That feedback is positive and directly tied to the expressive aspect. The interruptibility part gets less direct testing. The evaluation uses audience observers watching demonstrations rather than active collaborators who issue mid-task changes. Observer ratings do not probe whether the VLM-VLA loop actually processes new instructions without perception errors or unsafe motions in real time. No success rates, error breakdowns, or comparisons to baselines appear in the description, and the reliance on off-the-shelf models leaves open questions about robustness. This work is aimed at HRI researchers and roboticists building systems for human-centered spaces. Readers who want concrete examples of language-model pipelines for expressiveness and adaptability will find the framework description useful as a starting point. It is coherent enough on its own terms to deserve a serious referee, even though the evidence is still at the demonstration stage. I would send it to peer review so the authors can get concrete suggestions on adding active user studies and quantitative measures.

Referee Report

2 major / 2 minor

Summary. The paper introduces the ExpressMM framework for mobile manipulators in collaborative human-robot interaction (HRI). It combines a high-level language-guided planner based on a vision-language model (VLM) for perception and conversational reasoning with a low-level vision-language-action (VLA) policy to produce expressive robot behaviors. The framework is designed to support interruptible interactions where users can redirect the robot mid-task. The approach is demonstrated in a collaborative assembly scenario on a mobile manipulator, with evaluation consisting of live demonstrations followed by audience-based questionnaires assessing intent interpretation, social appropriateness, usefulness, safety, and predictability.

Significance. If the VLM-VLA integration reliably generates expressive, goal-aligned behaviors that adapt to interruptions without perception errors or unsafe motions, the work would contribute to HRI by moving beyond preprogrammed or LfD-based expressiveness toward language-driven, interruptible mobile manipulation. The live-demonstration setup and positive questionnaire feedback provide initial evidence of improved observer understanding. However, the absence of quantitative metrics, baselines, or active-collaborator testing limits the strength of claims about real-time adaptation and safety.

major comments (2)

[Experiments] Experiments section: The central claim that ExpressMM supports interruptible interactions rests on audience questionnaires from passive observation of demonstrations. No data or protocol is provided for active users issuing redirecting instructions mid-task, perception errors during interruptions, or physical safety in collaborative settings, leaving the weakest assumption (reliable VLM+VLA integration without errors or unsafe motions) untested.
[Abstract and Experiments] Abstract and Experiments: No quantitative metrics (e.g., success rates, latency, error rates), baseline comparisons (e.g., against non-expressive or non-interruptible policies), or ablation studies on the VLM planner vs. VLA policy are reported. This makes it impossible to verify the integration's performance or isolate the contribution of expressiveness.

minor comments (2)

[Abstract] The abstract states that 'questionnaire results show...' but provides no sample size, question wording, or statistical analysis; adding these details would improve reproducibility.
[Framework] Framework description would benefit from a diagram or pseudocode showing the exact data flow between the high-level VLM planner and low-level VLA policy, including how interruptions are parsed and executed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to clarify the scope of our evaluation and strengthen the presentation of limitations.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim that ExpressMM supports interruptible interactions rests on audience questionnaires from passive observation of demonstrations. No data or protocol is provided for active users issuing redirecting instructions mid-task, perception errors during interruptions, or physical safety in collaborative settings, leaving the weakest assumption (reliable VLM+VLA integration without errors or unsafe motions) untested.

Authors: We agree that the reported evaluation relies on audience questionnaires collected after live demonstrations rather than controlled experiments with active collaborators. The demonstrations illustrated interruptible behaviors, with the VLM planner handling redirecting instructions and the VLA policy executing the updated actions in real time. However, we did not include formal protocols, success rates for interruptions, or measurements of perception errors and physical safety. We will revise the Experiments section to describe the demonstration protocol in greater detail, explicitly state that the evaluation is perceptual and passive, and add a limitations paragraph outlining the need for future active-user studies with quantitative safety and error analysis. revision: yes
Referee: [Abstract and Experiments] Abstract and Experiments: No quantitative metrics (e.g., success rates, latency, error rates), baseline comparisons (e.g., against non-expressive or non-interruptible policies), or ablation studies on the VLM planner vs. VLA policy are reported. This makes it impossible to verify the integration's performance or isolate the contribution of expressiveness.

Authors: The manuscript prioritizes a framework demonstration and human-perception evaluation of expressiveness in HRI over performance benchmarking. We acknowledge that the absence of quantitative metrics, baselines, and ablations limits the ability to isolate component contributions or compare against alternatives. No systematic success rates or latency data were collected during the live demonstrations. We will revise the Experiments and Discussion sections to report any available observational data (e.g., observed response times to interruptions), add a limitations subsection addressing the lack of baselines and ablations, and indicate directions for future comparative evaluations. revision: partial

Circularity Check

0 steps flagged

No circularity: framework proposal and questionnaire evaluation are independent of self-referential inputs

full rationale

The paper describes the ExpressMM framework as an integration of a VLM-based high-level planner for perception/reasoning with a low-level VLA policy, plus support for interruptible interactions, then evaluates it via live demonstrations and audience questionnaires on interpretability, social appropriateness, usefulness, safety, and predictability. No equations, parameter fitting, or derivations appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on external demonstration results rather than reducing by construction to the framework definition itself. This is a standard descriptive robotics contribution with independent empirical support.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework depends on assumptions about the reliability of vision-language models for real-time perception and reasoning, plus the alignment of low-level policies with high-level plans. No new entities are invented, but many free parameters exist in the underlying pre-trained models.

free parameters (1)

VLM and VLA model parameters
Pre-trained or fine-tuned vision-language models contain millions of fitted parameters from large datasets that control perception, reasoning, and action generation.

axioms (2)

domain assumption Vision-language models can accurately perceive scenes and perform conversational reasoning for task planning in dynamic HRI settings.
Invoked for the high-level planner component.
domain assumption Low-level vision-language-action policies can generate expressive behaviors that communicate intent while completing manipulation tasks.
Core assumption for the low-level policy integration.

pith-pipeline@v0.9.0 · 5545 in / 1464 out tokens · 42196 ms · 2026-05-10T19:50:29.309160+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Mobile manipulation through an assistive home robot,

M. Ciocarlie, K. Hsiao, A. Leeper, and D. Gossow, “Mobile manipulation through an assistive home robot,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2012, pp. 5313–5320

2012
[2]

Development of a tele-nursing mobile manipulator for remote care-giving in quarantine areas,

Z. Li, P. Moran, Q. Dong, R. J. Shaw, and K. Hauser, “Development of a tele-nursing mobile manipulator for remote care-giving in quarantine areas,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 3581–3586

2017
[3]

Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households,

Z. Cao, Z. Wang, S. Xie, A. Liu, and L. Fan, “Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households,” in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2024, pp. 18091–18101

2024
[4]

Shaping human-robot interaction: understanding the social aspects of intelligent robotic products,

C. Bartneck and J. Forlizzi, “Shaping human-robot interaction: understanding the social aspects of intelligent robotic products,” in CHI ’04 Extended Abstracts on Human Factors in Computing Systems, in CHI EA ’04. New York, NY, USA: Association for Computing Machinery, Apr. 2004, pp. 1731–1732

2004
[5]

ELEGNT: Expressive and Functional Movement Design for Non-anthropomorphic Robot,

Y. Hu, P. Huang, M. Sivapurapu, and J. Zhang, “ELEGNT: Expressive and Functional Movement Design for Non-anthropomorphic Robot,” Jan. 21, 2025, arXiv: arXiv:2501.12493

work page arXiv 2025
[6]

MoveAE: Modifying Affective Robot Movements Using Classifying Variational Autoencoders,

M. Suguitan, R. Gomez, and G. Hoffman, “MoveAE: Modifying Affective Robot Movements Using Classifying Variational Autoencoders,” in 2020 15th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Mar. 2020, pp. 481–489

2020
[7]

Cost Functions for Robot Motion Style,

A. Zhou and A. D. Dragan, “Cost Functions for Robot Motion Style,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid: IEEE, Oct. 2018, pp. 3632–3639

2018
[8]

Teaching Robots to Span the Space of Functional Expressive Motion,

A. Sripathy, A. Bobu, Z. Li, K. Sreenath, D. S. Brown, and A. D. Dragan, “Teaching Robots to Span the Space of Functional Expressive Motion,” Aug. 02, 2022, arXiv: arXiv:2203.02091

work page arXiv 2022
[9]

Generative Expressive Robot Behaviors using Large Language Models,

K. Mahadevan et al., “Generative Expressive Robot Behaviors using Large Language Models,” in Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Boulder CO USA: ACM, Mar. 2024, pp. 482–491

2024
[10]

No, to the Right: Online Language Corrections for Robotic Manipulation via Shared Autonomy,

Y. Cui, S. Karamcheti, R. Palleti, N. Shivakumar, P. Liang, and D. Sadigh, “No, to the Right: Online Language Corrections for Robotic Manipulation via Shared Autonomy,” in Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, Stockholm Sweden: ACM, Mar. 2023, pp. 93–101

2023
[11]

Reactive Task and Motion Planning under Temporal Logic Specifications,

S. Li, D. Park, Y. Sung, J. A. Shah, and N. Roy, “Reactive Task and Motion Planning under Temporal Logic Specifications,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 12618–12624

2021
[12]

Hi robot: Open-ended instruction following with hierarchical vision-language-action models,

L. X. Shi et al., “Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models,” Feb. 26, 2025, arXiv: arXiv:2502.19417

work page arXiv 2025
[13]

GPT-4 Technical Report

OpenAI et al., “GPT-4 Technical Report,” Mar. 04, 2024, arXiv: arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” Mar. 01, 2020, arXiv: arXiv:1910.01108

work page internal anchor Pith review arXiv 2020
[15]

Language Models are Few-Shot Learners

T. B. Brown et al., “Language Models are Few-Shot Learners,” Jul. 22, 2020, arXiv: arXiv:2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020
[16]

A Note on Two Problems in Connexion with Graphs,

E. W. Dijkstra, “A Note on Two Problems in Connexion with Graphs,” in Edsger Wybe Dijkstra: His Life,Work, and Legacy, 1st ed., vol. 45, New York, NY, USA: Association for Computing Machinery, 2022, pp. 287–290. Accessed: Dec. 07,

2022
[17]

OpenAI GPT-5 System Card,

A. Singh et al., “OpenAI GPT-5 System Card,” 2026, arXiv

2026
[18]

| 2021 Science Report.” Accessed: Mar. 28,

2021
[19]

The Robot Who Tried Too Hard: Social Behaviour of a Robot Tutor Can Negatively Affect Child Learning,

J. Kennedy, P. Baxter, and T. Belpaeme, “The Robot Who Tried Too Hard: Social Behaviour of a Robot Tutor Can Negatively Affect Child Learning,” in Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, Portland Oregon USA: ACM, Mar. 2015, pp. 67–74

2015