pith. sign in

arxiv: 2607.00530 · v1 · pith:BB5H5OOLnew · submitted 2026-07-01 · 💻 cs.RO · cs.AI

From Technical Metrics to User Perception: A User Study of a Multimodal Human-Robot Interaction System for Object Detection and Grasping

Pith reviewed 2026-07-02 11:47 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords human-robot interactionuser perceptionobject graspingmultimodal systemuser studyobject detection
0
0 comments X

The pith

A 15-point gain in robot grasping success produces measurable improvements in user ratings of speed, reliability, and competence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a technical upgrade that lifts end-to-end task success from 75 percent to 90 percent creates differences that users can actually detect in live interaction. Researchers kept the motion controller unchanged and swapped only the perception and language modules, then asked 24 participants to perform the same tabletop grasping task with each version in a within-subject design. After each session participants rated perceived speed, reliability, and overall competence on Likert scales and stated a preference. Seventeen of the twenty-four participants favored the improved system, and all three rating scales showed statistically significant advantages for the upgraded configuration.

Core claim

Replacing the perception and language modules to raise end-to-end success from 75 percent to 90 percent produced a statistically significant user preference for the improved system (17 of 24 participants) together with large-effect-size gains on perceived speed, reliability, and competence/fluency after correction for multiple comparisons.

What carries the argument

Within-subject comparison of two HRI configurations that differ only in open-vocabulary object detection and action-extraction modules, evaluated through post-interaction 7-point Likert ratings and forced-choice preference.

If this is right

  • A 15-point technical gain in end-to-end success crosses the threshold of user perceptibility in direct interaction.
  • Ablation-identified module replacements can be validated as user-visible through controlled preference and rating data.
  • Benchmark improvements of this size warrant user-centred evaluation to confirm they affect experience.
  • The same within-subject protocol can be applied to other manipulation pipelines to test perceptibility of their technical changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests a practical lower bound on the size of technical gain needed before users notice changes in grasping systems.
  • Similar studies could map the minimum perceptible difference by testing smaller or larger success-rate gaps.
  • The approach may extend to tasks that involve longer sequences or different sensing modalities.

Load-bearing premise

The only material difference users experience between the two systems is the change in perception and language modules rather than any unmeasured interaction with the shared controller or order effects.

What would settle it

A larger replication study that finds no significant preference or rating difference between the two configurations would falsify the claim that the technical gain is perceptible.

Figures

Figures reproduced from arXiv: 2607.00530 by Jian Song, Shen Guanting, Tian Zi.

Figure 1
Figure 1. Figure 1: Information flow of the complete system for both pipelines. Hardware elements and data flow of the robotic manipulation pipeline. Figure from [20]. next section summarizes the baseline and improved systems, followed by the experimental design, results, discussion, and conclusions. II. BASELINE AND IMPROVED SYSTEM RECAP The user study compares two configurations of the same tabletop manipulation pipeline. B… view at source ↗
read the original abstract

Improvements in the technical performance of human--robot interaction (HRI) systems do not automatically translate into differences that human users can detect during live interaction. This paper investigates whether a 15 percentage point gain in end-to-end task success (from 75% in a multimodal baseline system to 90% in an improved configuration identified through a prior ablation study) is sufficient to produce consistent and measurable differences in user perception. The baseline system combines Whisper for speech recognition, Florence-2 for open-vocabulary object detection, LLaMA 3.1 for action extraction, and an interval Type-2 fuzzy logic controller for motion execution. The improved configuration replaces the perception and language modules with Grounding DINO + SAM and Qwen 3.5 9B, respectively, while retaining the same controller. A within-subject user study with 24 participants compared both systems on the same tabletop object-grasping task. After interacting with each configuration, participants rated perceived speed, reliability, and overall competence and fluency on a 7-point Likert scale. Results show that 17 out of 24 participants (70.83%) preferred the improved system (exact binomial test, p = 0.043, h = 0.43), and all three perceptual constructs were rated significantly higher for the improved configuration after Holm correction, with large to very large effect sizes (p < 0.001). These findings confirm that the identified technical improvements are perceptible to users in direct interaction and underscore the importance of complementing benchmark evaluation with user-centred evidence when assessing robotic manipulation pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a 15 percentage-point gain in end-to-end task success (75% to 90%) obtained by replacing the perception (Florence-2 to Grounding DINO + SAM) and language (LLaMA 3.1 to Qwen 3.5 9B) modules of a multimodal HRI grasping system produces measurable differences in user perception. A within-subject study with 24 participants found that 17/24 preferred the improved system (exact binomial p=0.043, h=0.43) and that all three Likert constructs (perceived speed, reliability, competence/fluency) were rated significantly higher after Holm correction (p<0.001, large-to-very-large effects).

Significance. If the causal attribution holds, the result supplies direct empirical evidence that technical ablation gains in robotic manipulation pipelines are perceptible to users during live interaction, thereby justifying the routine inclusion of user-centred evaluation alongside benchmark metrics. The use of exact binomial tests, Holm correction, and effect-size reporting is a methodological strength.

major comments (2)
  1. [Methods] Methods section (within-subject design paragraph): the abstract states that participants interacted with each configuration but provides no information on whether system order was counterbalanced or randomized. Without this detail, order effects (learning, anchoring, or fatigue) cannot be ruled out as alternative explanations for the 70.83% preference and the Likert differences, directly threatening the claim that the observed effects are attributable to the 15 pp success-rate gain from the changed modules.
  2. [Methods] Methods section (participant and procedure subsections): no sample-size justification, a priori power analysis, or discussion of individual-difference controls is referenced. With N=24 and a within-subject design, these omissions leave open whether the study was adequately powered to detect the reported effects and whether the shared controller introduced unmeasured confounds.
minor comments (2)
  1. [Abstract] Abstract: the three perceptual constructs are listed as 'perceived speed, reliability, and overall competence and fluency' but the exact Likert items and their aggregation are not defined; this should be clarified for reproducibility.
  2. [Results] Results: the effect-size symbol 'h=0.43' is reported without stating that it is Cohen's h; adding this label would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting these important methodological details. Both comments identify omissions in the current manuscript that we will address in revision. We provide point-by-point responses below.

read point-by-point responses
  1. Referee: [Methods] Methods section (within-subject design paragraph): the abstract states that participants interacted with each configuration but provides no information on whether system order was counterbalanced or randomized. Without this detail, order effects (learning, anchoring, or fatigue) cannot be ruled out as alternative explanations for the 70.83% preference and the Likert differences, directly threatening the claim that the observed effects are attributable to the 15 pp success-rate gain from the changed modules.

    Authors: We agree that the absence of this information is a limitation of the current manuscript. The order of the two system configurations was in fact randomized across participants. We will revise the Methods section to state this explicitly and to describe the randomization procedure. revision: yes

  2. Referee: [Methods] Methods section (participant and procedure subsections): no sample-size justification, a priori power analysis, or discussion of individual-difference controls is referenced. With N=24 and a within-subject design, these omissions leave open whether the study was adequately powered to detect the reported effects and whether the shared controller introduced unmeasured confounds.

    Authors: We acknowledge that the manuscript currently lacks an explicit sample-size justification or a priori power analysis. In revision we will add a paragraph discussing the choice of N=24 with reference to comparable HRI user studies and will report a post-hoc power analysis for the observed effects. The within-subject design controls for many stable individual differences by having each participant experience both conditions; we will clarify this point and note any additional controls that were applied. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical user study with independent participant data

full rationale

The paper reports results from a within-subject user study measuring participant preferences and Likert ratings for two HRI configurations. All load-bearing claims (70.83% preference, p=0.043; significant differences on perceptual constructs with p<0.001 after correction) rest on direct statistical analysis of observed responses rather than any equation, fitted parameter, or derivation that reduces to its own inputs. The prior ablation study is cited only to identify the improved configuration; the user-perception findings are measured independently and do not rely on self-citation chains or self-definitional constructs. This is a standard empirical design with no mathematical circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard assumptions of user-study methodology and non-parametric or corrected parametric tests; no free parameters, invented entities, or ad-hoc axioms are introduced beyond those implicit in Likert-scale analysis and binomial testing.

axioms (2)
  • domain assumption Participant ratings on 7-point Likert scales can be treated as ordinal data suitable for non-parametric or corrected parametric comparison after Holm adjustment.
    Invoked when reporting significant differences and effect sizes on the three perceptual constructs.
  • domain assumption The within-subject design with fixed task order or counterbalancing sufficiently isolates system differences from learning or fatigue effects.
    Required for attributing preference and rating differences to the system change rather than presentation order.

pith-pipeline@v0.9.1-grok · 5828 in / 1383 out tokens · 23933 ms · 2026-07-02T11:47:49.671722+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    An advanced medical robotic system augment- ing healthcare capabilities-robotic nursing assistant,

    J. Hu, A. Edsinger, Y .-J. Lim, N. Donaldson, M. Solano, A. Solochek, and R. Marchessault, “An advanced medical robotic system augment- ing healthcare capabilities-robotic nursing assistant,” in2011 IEEE international conference on robotics and automation. IEEE, 2011, pp. 6264–6269

  2. [2]

    A human-robot interac- tion applicution based on augmented reality (ar) for industrial robot grasping process,

    L. Zhao, Z. Hu, H. Ding, S. Ji, and J. Yan, “A human-robot interac- tion applicution based on augmented reality (ar) for industrial robot grasping process,” in2022 7th International Conference on Robotics and Automation Engineering (ICRAE). IEEE, 2022, pp. 312–316

  3. [3]

    An educational robot system of visual question answering for preschoolers,

    B. He, M. Xia, X. Yu, P. Jian, H. Meng, and Z. Chen, “An educational robot system of visual question answering for preschoolers,” in2017 2nd international conference on robotics and automation engineering (ICRAE). IEEE, 2017, pp. 441–445

  4. [4]

    Home robot service by ceiling ultrasonic locator and microphone ar- ray,

    S. Kagami, S. Thompson, Y . Nishida, T. Enomoto, and T. Matsui, “Home robot service by ceiling ultrasonic locator and microphone ar- ray,” inProceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006.IEEE, 2006, pp. 3171–3176

  5. [5]

    Anticipatory robot control for efficient human-robot collaboration,

    C.-M. Huang and B. Mutlu, “Anticipatory robot control for efficient human-robot collaboration,” in2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2016, pp. 83–90

  6. [6]

    Pointing gestures for human-robot interaction with the humanoid robot digit,

    V . Lorentz, M. Weiss, K. Hildebrand, and I. Boblan, “Pointing gestures for human-robot interaction with the humanoid robot digit,” in2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 2023, pp. 1886–1892

  7. [7]

    The human intention: a taxonomy attempt and its applications to robotics,

    J. E. Dom ´ınguez-Vidal and A. Sanfeliu, “The human intention: a taxonomy attempt and its applications to robotics,”International Journal of Social Robotics, vol. 17, no. 11, pp. 2479–2499, 2025

  8. [8]

    Autonomous laparoscopic robotic suturing with a novel actuated suturing tool and 3d endoscope,

    H. Saeidi, H. N. Le, J. D. Opfermann, S. L ´eonard, A. Kim, M. H. Hsieh, J. U. Kang, and A. Krieger, “Autonomous laparoscopic robotic suturing with a novel actuated suturing tool and 3d endoscope,” in 2019 international conference on robotics and automation (ICRA). IEEE, 2019, pp. 1541–1547

  9. [9]

    Improving human-robot interaction effectiveness in human-robot collaborative object trans- portation using force prediction,

    J. E. Dom ´ınguez-Vidal and A. Sanfeliu, “Improving human-robot interaction effectiveness in human-robot collaborative object trans- portation using force prediction,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 7839–7845

  10. [10]

    Exploring transformers and visual transformers for force prediction in human-robot collaborative transportation tasks,

    J. E. Dominguez-Vidal and A. Sanfeliu, “Exploring transformers and visual transformers for force prediction in human-robot collaborative transportation tasks,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 3191–3197

  11. [11]

    Force and velocity predic- tion in human-robot collaborative transportation tasks through video retentive networks,

    J. E. Dom ´ınguez-Vidal and A. Sanfeliu, “Force and velocity predic- tion in human-robot collaborative transportation tasks through video retentive networks,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 9307–9313

  12. [12]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  13. [13]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  14. [14]

    Language and sketching: An llm-driven interactive multimodal multitask robot navigation framework,

    W. Zu, W. Song, R. Chen, Z. Guo, F. Sun, Z. Tian, W. Pan, and J. Wang, “Language and sketching: An llm-driven interactive multimodal multitask robot navigation framework,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 1019–1025

  15. [15]

    Interactive navigation in environments with traversable obstacles using large language and vision-language models,

    Z. Zhang, A. Lin, C. W. Wong, X. Chu, Q. Dou, and K. S. Au, “Interactive navigation in environments with traversable obstacles using large language and vision-language models,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 7867–7873

  16. [16]

    Physically grounded vision-language models for robotic manipulation,

    J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, “Physically grounded vision-language models for robotic manipulation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 12 462–12 469

  17. [17]

    When the inference meets the explicitness or why multimodality can make us forget about the perfect predictor,

    J. E. Dom ´ınguez-Vidal and A. Sanfeliu, “When the inference meets the explicitness or why multimodality can make us forget about the perfect predictor,”International Journal of Social Robotics, vol. 17, no. 12, pp. 2965–2980, 2025

  18. [18]

    Anticipation and proactivity. unraveling both concepts in human-robot interaction through a han- dover example,

    J. E. Dominguez-Vidal and A. Sanfeliu, “Anticipation and proactivity. unraveling both concepts in human-robot interaction through a han- dover example,” in2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN). IEEE, 2024, pp. 957–962

  19. [19]

    Leveraging large language models in human-robot in- teraction: a critical analysis of potential and pitfalls,

    J. Atuhurra, “Leveraging large language models in human-robot in- teraction: a critical analysis of potential and pitfalls,”arXiv preprint arXiv:2405.00693, 2024

  20. [20]

    An approach to combining video and speech with large language models in human-robot interaction,

    G. Shen and Z. Tian, “An approach to combining video and speech with large language models in human-robot interaction,”arXiv preprint arXiv:2602.20219, 2026

  21. [21]

    Ast: Audio spectrogram trans- former,

    Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,”arXiv preprint arXiv:2104.01778, 2021

  22. [22]

    Robust speech recognition via large-scale weak super- vision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  23. [23]

    Florence-2: Advancing a unified representation for a variety of vision tasks,

    B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y . Lu, M. Zeng, C. Liu, and L. Yuan, “Florence-2: Advancing a unified representation for a variety of vision tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829

  24. [24]

    A ros 2 wrapper for florence-2: Multi-mode local vision-language inference for robotic systems,

    J. E. Dom ´ınguez-Vidal, “A ros 2 wrapper for florence-2: Multi-mode local vision-language inference for robotic systems,”arXiv preprint arXiv:2604.01179, 2026

  25. [25]

    Fuzzy logic systems for engineering: a tutorial,

    J. M. Mendel, “Fuzzy logic systems for engineering: a tutorial,” Proceedings of the IEEE, vol. 83, no. 3, pp. 345–377, 2002

  26. [26]

    Interval type-2 fuzzy logic systems: theory and design,

    Q. Liang and J. M. Mendel, “Interval type-2 fuzzy logic systems: theory and design,”IEEE Transactions on Fuzzy systems, vol. 8, no. 5, pp. 535–550, 2000

  27. [27]

    Fuzzy logic introduction,

    M. Hellmann, “Fuzzy logic introduction,”Universit ´e de Rennes, vol. 1, no. 1, 2001

  28. [28]

    A type-2 fuzzy logic controller for autonomous mobile robots,

    H. Hagras, “A type-2 fuzzy logic controller for autonomous mobile robots,” in2004 IEEE International conference on fuzzy systems (IEEE Cat. No. 04CH37542), vol. 2. IEEE, 2004, pp. 965–970