Recognition: no theorem link
ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions
Pith reviewed 2026-05-10 19:50 UTC · model grok-4.3
The pith
ExpressMM combines a vision-language planner with a low-level action policy to let mobile manipulators produce expressive behaviors that communicate intent during collaborative tasks with humans, including when users interrupt or redirect.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the ExpressMM framework integrates a high-level language-guided planner based on a vision-language model for perception and conversational reasoning with a low-level vision-language-action policy to generate expressive robot behaviors during collaborative HRI tasks. The framework further supports interruptible interactions to accommodate updated or redirecting instructions by users. In a demonstrated mobile manipulator assembly scenario, audience evaluations showed that these behaviors helped observers clearly interpret the robot's actions and intentions, supported socially appropriate and understandable interactions, and led participants to view the robot as useful
What carries the argument
The ExpressMM framework, which links a high-level vision-language model planner for scene understanding and reasoning to a low-level vision-language-action policy that produces task-aligned expressive motions.
If this is right
- Expressive behaviors generated by the framework allow observers to clearly interpret the robot's actions and intentions.
- The approach supports socially appropriate and understandable interactions in collaborative human-robot settings.
- Participants perceive the robot as useful for collaborative tasks and as behaving in a predictable and safe manner.
- Interruptible interactions enable the robot to accommodate updated or redirecting instructions from users during task execution.
Where Pith is reading between the lines
- The same planner-plus-policy structure could reduce reliance on pre-programmed motions when deploying robots in varied human environments.
- Clear intent communication during interruptions might lower the cognitive load on human collaborators over repeated interactions.
Load-bearing premise
The vision-language model planner combined with the action policy will reliably generate expressive behaviors that match task goals and adapt to human interruptions without producing unsafe motions or misinterpretations in real settings.
What would settle it
A live test in which a human redirects the robot mid-task and the system either produces motions that observers cannot interpret or generates unsafe movements that violate the reported safety and predictability ratings.
Figures
read the original abstract
Mobile manipulators are increasingly deployed in human-centered environments to perform tasks. While completing such tasks, they should also be able to communicate their intent to the people around them using expressive robot behaviors. Prior work on expressive robot behaviors has used preprogrammed or learning-from-demonstration-based expressive motions and large language model generated high-level interactions. The majority of these existing approaches have not considered human-robot interactions (HRI) where users may interrupt, modify, or redirect a robot's actions during task execution. In this paper, we develop the novel ExpressMM framework that integrates a high-level language-guided planner based on a vision-language model for perception and conversational reasoning with a low-level vision-language-action policy to generate expressive robot behaviors during collaborative HRI tasks. Furthermore, ExpressMM supports interruptible interactions to accommodate updated or redirecting instructions by users. We demonstrate ExpressMM on a mobile manipulator assisting a human in a collaborative assembly scenario and conduct audience-based evaluation of live HRI demonstrations. Questionnaire results show that the ExpressMM-enabled expressive behaviors helped observers clearly interpret the robot's actions and intentions while supporting socially appropriate and understandable interactions. Participants also reported that the robot was useful for collaborative tasks and behaved in a predictable and safe manner during the demonstrations, fostering positive perceptions of the robot's usefulness, safety, and predictability during the collaborative tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the ExpressMM framework for mobile manipulators in collaborative human-robot interaction (HRI). It combines a high-level language-guided planner based on a vision-language model (VLM) for perception and conversational reasoning with a low-level vision-language-action (VLA) policy to produce expressive robot behaviors. The framework is designed to support interruptible interactions where users can redirect the robot mid-task. The approach is demonstrated in a collaborative assembly scenario on a mobile manipulator, with evaluation consisting of live demonstrations followed by audience-based questionnaires assessing intent interpretation, social appropriateness, usefulness, safety, and predictability.
Significance. If the VLM-VLA integration reliably generates expressive, goal-aligned behaviors that adapt to interruptions without perception errors or unsafe motions, the work would contribute to HRI by moving beyond preprogrammed or LfD-based expressiveness toward language-driven, interruptible mobile manipulation. The live-demonstration setup and positive questionnaire feedback provide initial evidence of improved observer understanding. However, the absence of quantitative metrics, baselines, or active-collaborator testing limits the strength of claims about real-time adaptation and safety.
major comments (2)
- [Experiments] Experiments section: The central claim that ExpressMM supports interruptible interactions rests on audience questionnaires from passive observation of demonstrations. No data or protocol is provided for active users issuing redirecting instructions mid-task, perception errors during interruptions, or physical safety in collaborative settings, leaving the weakest assumption (reliable VLM+VLA integration without errors or unsafe motions) untested.
- [Abstract and Experiments] Abstract and Experiments: No quantitative metrics (e.g., success rates, latency, error rates), baseline comparisons (e.g., against non-expressive or non-interruptible policies), or ablation studies on the VLM planner vs. VLA policy are reported. This makes it impossible to verify the integration's performance or isolate the contribution of expressiveness.
minor comments (2)
- [Abstract] The abstract states that 'questionnaire results show...' but provides no sample size, question wording, or statistical analysis; adding these details would improve reproducibility.
- [Framework] Framework description would benefit from a diagram or pseudocode showing the exact data flow between the high-level VLM planner and low-level VLA policy, including how interruptions are parsed and executed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to clarify the scope of our evaluation and strengthen the presentation of limitations.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim that ExpressMM supports interruptible interactions rests on audience questionnaires from passive observation of demonstrations. No data or protocol is provided for active users issuing redirecting instructions mid-task, perception errors during interruptions, or physical safety in collaborative settings, leaving the weakest assumption (reliable VLM+VLA integration without errors or unsafe motions) untested.
Authors: We agree that the reported evaluation relies on audience questionnaires collected after live demonstrations rather than controlled experiments with active collaborators. The demonstrations illustrated interruptible behaviors, with the VLM planner handling redirecting instructions and the VLA policy executing the updated actions in real time. However, we did not include formal protocols, success rates for interruptions, or measurements of perception errors and physical safety. We will revise the Experiments section to describe the demonstration protocol in greater detail, explicitly state that the evaluation is perceptual and passive, and add a limitations paragraph outlining the need for future active-user studies with quantitative safety and error analysis. revision: yes
-
Referee: [Abstract and Experiments] Abstract and Experiments: No quantitative metrics (e.g., success rates, latency, error rates), baseline comparisons (e.g., against non-expressive or non-interruptible policies), or ablation studies on the VLM planner vs. VLA policy are reported. This makes it impossible to verify the integration's performance or isolate the contribution of expressiveness.
Authors: The manuscript prioritizes a framework demonstration and human-perception evaluation of expressiveness in HRI over performance benchmarking. We acknowledge that the absence of quantitative metrics, baselines, and ablations limits the ability to isolate component contributions or compare against alternatives. No systematic success rates or latency data were collected during the live demonstrations. We will revise the Experiments and Discussion sections to report any available observational data (e.g., observed response times to interruptions), add a limitations subsection addressing the lack of baselines and ablations, and indicate directions for future comparative evaluations. revision: partial
Circularity Check
No circularity: framework proposal and questionnaire evaluation are independent of self-referential inputs
full rationale
The paper describes the ExpressMM framework as an integration of a VLM-based high-level planner for perception/reasoning with a low-level VLA policy, plus support for interruptible interactions, then evaluates it via live demonstrations and audience questionnaires on interpretability, social appropriateness, usefulness, safety, and predictability. No equations, parameter fitting, or derivations appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on external demonstration results rather than reducing by construction to the framework definition itself. This is a standard descriptive robotics contribution with independent empirical support.
Axiom & Free-Parameter Ledger
free parameters (1)
- VLM and VLA model parameters
axioms (2)
- domain assumption Vision-language models can accurately perceive scenes and perform conversational reasoning for task planning in dynamic HRI settings.
- domain assumption Low-level vision-language-action policies can generate expressive behaviors that communicate intent while completing manipulation tasks.
Reference graph
Works this paper leans on
-
[1]
Mobile manipulation through an assistive home robot,
M. Ciocarlie, K. Hsiao, A. Leeper, and D. Gossow, “Mobile manipulation through an assistive home robot,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2012, pp. 5313–5320
2012
-
[2]
Development of a tele-nursing mobile manipulator for remote care-giving in quarantine areas,
Z. Li, P. Moran, Q. Dong, R. J. Shaw, and K. Hauser, “Development of a tele-nursing mobile manipulator for remote care-giving in quarantine areas,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 3581–3586
2017
-
[3]
Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households,
Z. Cao, Z. Wang, S. Xie, A. Liu, and L. Fan, “Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households,” in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2024, pp. 18091–18101
2024
-
[4]
Shaping human-robot interaction: understanding the social aspects of intelligent robotic products,
C. Bartneck and J. Forlizzi, “Shaping human-robot interaction: understanding the social aspects of intelligent robotic products,” in CHI ’04 Extended Abstracts on Human Factors in Computing Systems, in CHI EA ’04. New York, NY, USA: Association for Computing Machinery, Apr. 2004, pp. 1731–1732
2004
-
[5]
ELEGNT: Expressive and Functional Movement Design for Non-anthropomorphic Robot,
Y. Hu, P. Huang, M. Sivapurapu, and J. Zhang, “ELEGNT: Expressive and Functional Movement Design for Non-anthropomorphic Robot,” Jan. 21, 2025, arXiv: arXiv:2501.12493
-
[6]
MoveAE: Modifying Affective Robot Movements Using Classifying Variational Autoencoders,
M. Suguitan, R. Gomez, and G. Hoffman, “MoveAE: Modifying Affective Robot Movements Using Classifying Variational Autoencoders,” in 2020 15th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Mar. 2020, pp. 481–489
2020
-
[7]
Cost Functions for Robot Motion Style,
A. Zhou and A. D. Dragan, “Cost Functions for Robot Motion Style,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid: IEEE, Oct. 2018, pp. 3632–3639
2018
-
[8]
Teaching Robots to Span the Space of Functional Expressive Motion,
A. Sripathy, A. Bobu, Z. Li, K. Sreenath, D. S. Brown, and A. D. Dragan, “Teaching Robots to Span the Space of Functional Expressive Motion,” Aug. 02, 2022, arXiv: arXiv:2203.02091
-
[9]
Generative Expressive Robot Behaviors using Large Language Models,
K. Mahadevan et al., “Generative Expressive Robot Behaviors using Large Language Models,” in Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Boulder CO USA: ACM, Mar. 2024, pp. 482–491
2024
-
[10]
No, to the Right: Online Language Corrections for Robotic Manipulation via Shared Autonomy,
Y. Cui, S. Karamcheti, R. Palleti, N. Shivakumar, P. Liang, and D. Sadigh, “No, to the Right: Online Language Corrections for Robotic Manipulation via Shared Autonomy,” in Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, Stockholm Sweden: ACM, Mar. 2023, pp. 93–101
2023
-
[11]
Reactive Task and Motion Planning under Temporal Logic Specifications,
S. Li, D. Park, Y. Sung, J. A. Shah, and N. Roy, “Reactive Task and Motion Planning under Temporal Logic Specifications,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 12618–12624
2021
-
[12]
Hi robot: Open-ended instruction following with hierarchical vision-language-action models,
L. X. Shi et al., “Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models,” Feb. 26, 2025, arXiv: arXiv:2502.19417
-
[13]
OpenAI et al., “GPT-4 Technical Report,” Mar. 04, 2024, arXiv: arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” Mar. 01, 2020, arXiv: arXiv:1910.01108
work page internal anchor Pith review arXiv 2020
-
[15]
Language Models are Few-Shot Learners
T. B. Brown et al., “Language Models are Few-Shot Learners,” Jul. 22, 2020, arXiv: arXiv:2005.14165
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[16]
A Note on Two Problems in Connexion with Graphs,
E. W. Dijkstra, “A Note on Two Problems in Connexion with Graphs,” in Edsger Wybe Dijkstra: His Life,Work, and Legacy, 1st ed., vol. 45, New York, NY, USA: Association for Computing Machinery, 2022, pp. 287–290. Accessed: Dec. 07,
2022
-
[17]
OpenAI GPT-5 System Card,
A. Singh et al., “OpenAI GPT-5 System Card,” 2026, arXiv
2026
-
[18]
| 2021 Science Report.” Accessed: Mar. 28,
2021
-
[19]
The Robot Who Tried Too Hard: Social Behaviour of a Robot Tutor Can Negatively Affect Child Learning,
J. Kennedy, P. Baxter, and T. Belpaeme, “The Robot Who Tried Too Hard: Social Behaviour of a Robot Tutor Can Negatively Affect Child Learning,” in Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, Portland Oregon USA: ACM, Mar. 2015, pp. 67–74
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.