pith. sign in

arxiv: 2606.23339 · v1 · pith:3GUIPCOAnew · submitted 2026-06-22 · 💻 cs.RO

When Robots Rate Their Own Interactions: Engagement Validity and the Strangeness Failure

Pith reviewed 2026-06-26 08:08 UTC · model grok-4.3

classification 💻 cs.RO
keywords human-robot interactionLLM evaluationinverted evaluationengagement assessmentstrangenessHRI questionnaires
0
0 comments X

The pith

LLM-powered robots agree with humans on engagement ratings but systematically invert strangeness assessments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests an inverted evaluation approach in which LLM-powered robots complete the same HRI questionnaires that humans use, but from the robot's own perspective. Ratings from five LLMs across 25 interactions matched human ground truth on satisfaction and enjoyment dimensions yet reversed the comfort and strangeness scores. The same inversion appeared when a physical Nao robot performed live, turn-by-turn assessments. The authors conclude that LLMs lack the internal affective states needed to judge strangeness and that supplementary sensor data will be required for reliable robot self-evaluation.

Core claim

LLMs achieve moderate-to-strong agreement with humans on engagement (satisfaction r up to .65, enjoyment r up to .72) with high test-retest reliability, but invert the comfort/strangeness dimension (r = -.44 to -.67). The pattern holds across models, synthetic controls, and embodied deployment.

What carries the argument

Inverted evaluation, in which the LLM completes standardized questionnaires (HRI-CUES, Godspeed, RoSAS) from the robot's perspective for direct comparison to human ratings.

Load-bearing premise

The inversion on strangeness occurs because LLMs lack access to internal affective states rather than because of questionnaire wording or prompt artifacts.

What would settle it

Feeding the LLM physiological signals, gaze, or proxemics data during assessment and checking whether the strangeness correlation shifts from negative to positive would test the claim.

Figures

Figures reproduced from arXiv: 2606.23339 by Hasan Mahmud, Jamison Heard, Mohammad Javad Khojasteh, Prabu David, Victor Lockwood.

Figure 1
Figure 1. Figure 1: Self-Report Agreement: Pearson r between LLM ratings and human ground truth across five models and five HRI-CUES dimensions. Engagement dimensions and comfort reached significance (p < .05); quality did not. TABLE III PHASE 2: DESCRIPTIVE STATISTICS, RELIABILITY, AND CROSS-MODEL AGREEMENT FOR GODSPEED AND ROSAS (BASELINE CONDITION). rCROSS = PEARSON r BETWEEN CLAUDE SONNET AND GPT-4O ON PARTICIPANT-LEVEL S… view at source ↗
Figure 2
Figure 2. Figure 2: Turn-level divergence (Robot − Human) on HRI-CUES items. Values near zero indicate agreement. Strangeness (pink) shows persistent negative divergence for P1 and P2. P4 used audio-only input (no camera). layer is analogous to a soul that can exist without a body, and whether the robot’s memories persist across sessions. P1 rated strangeness at 4 on every turn and 5 post-hoc, while the Robot rated 1–2 throug… view at source ↗
read the original abstract

Human-robot interaction (HRI) evaluation relies almost exclusively on human-completed questionnaires, leaving the robot's perspective unexamined. We propose an \textit{inverted evaluation}, in which LLM-powered robots complete the same standardized instruments from their own perspective, and test whether these ratings agree with human ground truth. In Study~1, five LLMs completed HRI-CUES, Godspeed, and RoSAS questionnaires for 25~interactions ($N = 1{,}522$ evaluations) from the HRI-CUES dataset. LLMs achieved moderate-to-strong agreement on engagement dimensions (satisfaction $r$ up to $.65$ and enjoyment $r$ up to $.72$) with excellent test-retest reliability (ICC $\geq .82$), but \textit{systematically inverted} the comfort/strangeness dimension ($r = -.44$ to $-.67$, all $p < .05$), conflating engagement with comfort. In Study~2, a Nao robot running Claude~Sonnet~4.5 replicated these patterns in live interactions ($N = 4$), including real-time turn-by-turn assessment. The strangeness failure persisted across five models, synthetic controls, and embodied deployment for two participants. We argue that current LLM-based robots lack access to the internal affective states needed to assess constructs like strangeness, and that inverted evaluation requires supplementary modalities (e.g., physiological signals, gaze, proxemics) to move beyond behavioral proxies. These findings establish boundary conditions for using LLMs as interaction evaluators in HRI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an 'inverted evaluation' method in which LLM-powered robots complete standard HRI questionnaires (HRI-CUES, Godspeed, RoSAS) from the robot's perspective and compares these ratings to human ground truth. Study 1 has five LLMs rate 25 interactions from the HRI-CUES dataset (1,522 total evaluations), finding moderate-to-strong agreement on engagement dimensions (satisfaction r up to .65, enjoyment r up to .72, ICC ≥ .82) but systematic inversion on comfort/strangeness (r = -.44 to -.67, all p < .05). Study 2 replicates the pattern in live Nao robot interactions (N=4) using Claude Sonnet 4.5 with real-time assessment. The authors conclude that LLMs lack access to internal affective states for constructs like strangeness and therefore require supplementary modalities (physiological signals, gaze, proxemics) beyond behavioral proxies.

Significance. If the inversion pattern is robust, the work supplies concrete boundary conditions for deploying LLMs as interaction evaluators in HRI. The consistency across five models plus the embodied live-robot replication constitutes a strength of the empirical design. The findings could usefully caution against sole reliance on LLM self-ratings for affective dimensions and motivate multimodal extensions, provided the causal attribution is clarified.

major comments (2)
  1. [Abstract (argument paragraph)] Abstract (argument paragraph): The claim that the negative correlations on comfort/strangeness reflect LLMs' lack of 'access to the internal affective states' (rather than questionnaire wording, prompt construction, or training-data artifacts) is invoked to justify the recommendation for supplementary modalities. No ablation is reported that isolates this factor (e.g., prompt variants supplying simulated internal-state signals, rephrased strangeness items, or non-LLM baselines also lacking embodiment). This interpretive step is load-bearing for the central boundary-condition argument.
  2. [Study 2] Study 2: The live-robot replication is conducted with N=4, yet the abstract supplies no error bars, exclusion criteria, or full statistical tables for these trials. Given that the inversion pattern is asserted to persist 'across ... embodied deployment,' the small sample size and limited reporting undermine the strength of the embodied evidence relative to the larger Study 1 dataset.
minor comments (2)
  1. [Abstract] Abstract: The statement that the strangeness failure 'persisted across five models, synthetic controls, and embodied deployment for two participants' does not define the synthetic controls or clarify why only two of the four participants are referenced for the embodied case.
  2. [Abstract] Abstract: Correlation values are reported without accompanying sample sizes per dimension or confidence intervals, reducing transparency even though overall N=1,522 is stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating revisions where the manuscript will be updated.

read point-by-point responses
  1. Referee: [Abstract (argument paragraph)] Abstract (argument paragraph): The claim that the negative correlations on comfort/strangeness reflect LLMs' lack of 'access to the internal affective states' (rather than questionnaire wording, prompt construction, or training-data artifacts) is invoked to justify the recommendation for supplementary modalities. No ablation is reported that isolates this factor (e.g., prompt variants supplying simulated internal-state signals, rephrased strangeness items, or non-LLM baselines also lacking embodiment). This interpretive step is load-bearing for the central boundary-condition argument.

    Authors: We agree that the manuscript presents no ablation studies to isolate the source of the inversion. The observed pattern is consistent across five LLMs with differing architectures and the live-robot replication, which reduces the likelihood that it arises solely from a single prompt template or wording choice. Nevertheless, this does not constitute a definitive causal demonstration. In revision we will (1) temper the causal phrasing in the abstract and discussion, (2) explicitly list the absence of ablation experiments as a limitation, and (3) frame the call for supplementary modalities as motivated by the empirical failure rather than as proven causation. No new experiments will be added at this stage. revision: partial

  2. Referee: [Study 2] Study 2: The live-robot replication is conducted with N=4, yet the abstract supplies no error bars, exclusion criteria, or full statistical tables for these trials. Given that the inversion pattern is asserted to persist 'across ... embodied deployment,' the small sample size and limited reporting undermine the strength of the embodied evidence relative to the larger Study 1 dataset.

    Authors: We accept that the reporting for Study 2 is inadequate. The N=4 comprises four live interactions (two per participant) with no exclusions applied. In the revised manuscript we will expand the Study 2 section to include (1) error bars on all reported correlations, (2) complete statistical tables, (3) a clearer description of the procedure and participant count, and (4) explicit qualification of the replication as preliminary. The abstract will be updated to reflect these limitations while retaining the claim that the pattern was observed in embodied deployment. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of observed correlations

full rationale

The manuscript contains no derivation chain, equations, fitted parameters, or first-principles predictions. All reported results are direct empirical computations (Pearson r, ICC) on questionnaire responses from LLMs and human ground truth. The central pattern (negative correlations on strangeness) is presented as an observed outcome, not derived from any self-referential definition or prior self-citation. The interpretive claim about missing affective states is an untested post-hoc argument in the abstract and discussion; it does not reduce any quantitative result to its own inputs by construction. No self-citation load-bearing steps, ansatzes, or renamings of known results appear in the reported analyses.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard statistical assumptions for correlation and ICC calculations plus the validity of existing HRI questionnaires; no new free parameters, axioms, or invented entities are introduced.

axioms (1)
  • standard math Pearson correlation and ICC are appropriate for measuring agreement between LLM and human ratings on ordinal questionnaire scales
    Invoked implicitly when reporting r and ICC values in Study 1

pith-pipeline@v0.9.1-grok · 5824 in / 1099 out tokens · 22896 ms · 2026-06-26T08:08:42.019539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Measurement in- struments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots,

    C. Bartneck, D. Kuli ´c, E. Croft, and S. Zoghbi, “Measurement in- struments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots,”International Journal of Social Robotics, vol. 1, no. 1, pp. 71–81, 2009

  2. [2]

    The robotic social attributes scale (RoSAS): Development and validation,

    C. M. Carpinella, A. B. Wyman, M. A. Perez, and S. J. Stroessner, “The robotic social attributes scale (RoSAS): Development and validation,” in Proc. ACM/IEEE Int. Conf. Human-Robot Interaction (HRI), pp. 254– 262, 2017

  3. [3]

    HRI CUES dataset (anonymized)

    B. Irfanet al., “HRI CUES dataset (anonymized).” Zenodo, 2024. CC- BY 4.0

  4. [4]

    Building knowledge from interactions: An LLM-based architecture for adaptive tutoring and social reasoning,

    L. Garello, G. Belgiovine, G. Russo, F. Rea, and A. Sciutti, “Building knowledge from interactions: An LLM-based architecture for adaptive tutoring and social reasoning,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2025

  5. [5]

    Understanding Large-Language Model (LLM)-powered human-robot interaction,

    C. Y . Kim, C. P. Lee, and B. Mutlu, “Understanding Large-Language Model (LLM)-powered human-robot interaction,” inProc. ACM/IEEE Int. Conf. Human-Robot Interaction (HRI), pp. 371–380, 2024

  6. [6]

    What people share with a robot when feeling lonely and stressed and how it helps over time,

    G. Laban, S. Chiang, and H. Gunes, “What people share with a robot when feeling lonely and stressed and how it helps over time,” in2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 1930–1935, IEEE, 2025

  7. [7]

    VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,

    D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,”IEEE Robotics and Automation Letters, 2024

  8. [8]

    Out of one, many: Using language models to simulate human samples,

    L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate, “Out of one, many: Using language models to simulate human samples,”Political Analysis, vol. 31, no. 3, pp. 337–351, 2023

  9. [9]

    HumanStudy-Bench: Towards AI agent design for participant simu- lation,

    X. Liu, H. Shang, Z. Liu, X. Liu, Y . Xiao, Y . Tu, and H. Jin, “HumanStudy-Bench: Towards AI agent design for participant simu- lation,”arXiv preprint arXiv:2602.00685, 2026

  10. [10]

    SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

    T. Hu, J. Baumann, L. Lupo, D. Hovy, N. Collier, and P. R ¨ottger, “SimBench: Benchmarking the ability of large language models to simulate human behaviors,”arXiv preprint arXiv:2510.17516, 2025

  11. [11]

    Towards understanding sycophancy in language models,

    M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. Bowman, et al., “Towards understanding sycophancy in language models,” inPro- ceedings of the International Conference on Learning Representations (ICLR), 2024

  12. [12]

    Self-assessment tests are unreliable measures of llm personality,

    A. Gupta, X. Song, and G. Anumanchipalli, “Self-assessment tests are unreliable measures of llm personality,” inProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 301–314, 2024

  13. [13]

    Do LLMs have distinct and consistent personality? TRAIT: Personality testset designed for LLMs with psychometrics,

    S. Lee, S. Lim, S. Han, G. Oh, H. Chae, J. Chung, M. Kim, B.-w. Kwak, et al., “Do LLMs have distinct and consistent personality? TRAIT: Personality testset designed for LLMs with psychometrics,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2024

  14. [14]

    Are large language models aligned with people’s social intuitions for human- robot interactions?,

    L. Wachowiak, A. Coles, O. Celiktutan, and G. Canal, “Are large language models aligned with people’s social intuitions for human- robot interactions?,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2024

  15. [15]

    On the reliability of psychological scales on large language models,

    J.-t. Huang, W. Jiao, M. H. Lam, E. J. Li, W. Wang, and M. Lyu, “On the reliability of psychological scales on large language models,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), pp. 6152–6173, 2024

  16. [16]

    Large language models fail on trivial alterations to theory- of-mind tasks,

    T. Ullman, “Large language models fail on trivial alterations to theory- of-mind tasks,”arXiv preprint arXiv:2302.08399, 2023

  17. [17]

    Social robots are like real people: First impressions, attributes, and stereotyping of social robots,

    B. Reeves, J. Hancock, and X. Liu, “Social robots are like real people: First impressions, attributes, and stereotyping of social robots,” Technology, Mind, and Behavior, vol. 1, no. 1, p. 76, 2020

  18. [18]

    The social perception of humanoid and non-humanoid robots: Effects of gendered and machinelike features,

    S. J. Stroessner and J. Benitez, “The social perception of humanoid and non-humanoid robots: Effects of gendered and machinelike features,” International Journal of Social Robotics, vol. 11, no. 2, pp. 305–315, 2019

  19. [19]

    Toward grounded commonsense reasoning,

    M. Kwon, H. Hu, V . Myers, S. Karamcheti, A. Dragan, and D. Sadigh, “Toward grounded commonsense reasoning,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), pp. 5463–5470, 2024

  20. [20]

    Chat with the environment: Interactive multimodal perception using large language models,

    X. Zhao, M. Li, C. Weber, M. B. Hafez, and S. Wermter, “Chat with the environment: Interactive multimodal perception using large language models,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), pp. 3590–3596, 2023

  21. [21]

    Between reality and delusion: Challenges of applying large language models to companion robots for open-domain dialogues with older adults,

    B. Irfan, S. Kuoppam ¨aki, A. Hosseini, and G. Skantze, “Between reality and delusion: Challenges of applying large language models to companion robots for open-domain dialogues with older adults,” Autonomous Robots, vol. 49, no. 1, p. 9, 2025

  22. [22]

    Infusing Theory of Mind into Socially Intelligent LLM Agents

    E. Hwanget al., “Infusing theory of mind into socially intelligent LLM agents,”arXiv preprint arXiv:2509.22887, 2025

  23. [23]

    SOTOPIA: Interactive evaluation for social intelligence in language agents,

    X. Zhou, H. Zhu, L. Mathur, R. Zhang, H. Yu, Z. Qi, L.-P. Morency, Y . Bisk, D. Fried, G. Neubig,et al., “SOTOPIA: Interactive evaluation for social intelligence in language agents,” inProc. Int. Conf. Learning Representations (ICLR), 2024

  24. [24]

    Personallm: Investigating the ability of large language models to ex- press personality traits,

    H. Jiang, X. Zhang, X. Cao, C. Breazeal, D. Roy, and J. Kabbara, “Personallm: Investigating the ability of large language models to ex- press personality traits,” inFindings of the association for computational linguistics: NAACL 2024, pp. 3605–3627, 2024

  25. [25]

    Can LLM “self-report

    Y . Zhuet al., “Can LLM “self-report”? exploring the validity of LLM-based self-rating in conversational agents,”arXiv preprint arXiv:2412.00207, 2024

  26. [26]

    Human-robot interaction conversational user enjoyment scale (HRI CUES),

    B. Irfan, J. Miniota, S. Thunberg, E. Lagerstedt, S. Kuoppam ¨aki, G. Skantze, and A. Pereira, “Human-robot interaction conversational user enjoyment scale (HRI CUES),”IEEE Transactions on Affective Computing, 2025

  27. [27]

    Reporting guidelines for large language models in human–robot interaction,

    C. Matuszek, T. Williams, N. DePalma, R. Mead, R. Wen, E. Schneiders, C. Kennington, and A. Bezabih, “Reporting guidelines for large language models in human–robot interaction,”J. Hum.-Robot Interact., vol. 15, Jan. 2026

  28. [28]

    Scientists rise up against statistical significance,

    V . Amrhein, S. Greenland, and B. McShane, “Scientists rise up against statistical significance,”Nature, vol. 567, no. 7748, pp. 305–307, 2019

  29. [29]

    10 years of human-NAO interaction research: A scoping review,

    A. Amirova, N. Rakhymbayeva, E. Yadollahi, A. Sandygulova, and W. Johal, “10 years of human-NAO interaction research: A scoping review,”Frontiers in Robotics and AI, vol. 8, p. 744526, 2021

  30. [30]

    Robust speech recognition via large-scale weak supervi- sion,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inProceedings of the 40th International Conference on Machine Learning (ICML), pp. 28492–28518, 2023

  31. [31]

    Academically intelligent LLMs are not necessarily socially intelligent,

    R. Xuet al., “Academically intelligent LLMs are not necessarily socially intelligent,”arXiv preprint arXiv:2403.06591, 2024

  32. [32]

    Affective state estimation for human-robot interaction,

    D. Kuli ´c and E. A. Croft, “Affective state estimation for human-robot interaction,”IEEE Transactions on Robotics, vol. 23, no. 5, pp. 991– 1000, 2007

  33. [33]

    Addressing data scarcity in multimodal user state recognition by combining semi-supervised and supervised learning,

    H. V oß, H. Wersing, and S. Kopp, “Addressing data scarcity in multimodal user state recognition by combining semi-supervised and supervised learning,” inCompanion Publication of the Int. Conf. on Multimodal Interaction (ICMI), 2021

  34. [34]

    V ADER: A parsimonious rule-based model for sentiment analysis of social media text,

    C. Hutto and E. Gilbert, “V ADER: A parsimonious rule-based model for sentiment analysis of social media text,” inProceedings of the International AAAI Conference on Web and Social Media, vol. 8, pp. 216–225, 2014

  35. [35]

    Regulating modal- ity utilization within multimodal fusion networks,

    S. Singh, E. Saber, P. P. Markopoulos, and J. Heard, “Regulating modal- ity utilization within multimodal fusion networks,”Sensors, vol. 24, no. 18, p. 6054, 2024