When Robots Rate Their Own Interactions: Engagement Validity and the Strangeness Failure

Hasan Mahmud; Jamison Heard; Mohammad Javad Khojasteh; Prabu David; Victor Lockwood

arxiv: 2606.23339 · v1 · pith:3GUIPCOAnew · submitted 2026-06-22 · 💻 cs.RO

When Robots Rate Their Own Interactions: Engagement Validity and the Strangeness Failure

Victor Lockwood , Hasan Mahmud , Mohammad Javad Khojasteh , Prabu David , Jamison Heard This is my paper

Pith reviewed 2026-06-26 08:08 UTC · model grok-4.3

classification 💻 cs.RO

keywords human-robot interactionLLM evaluationinverted evaluationengagement assessmentstrangenessHRI questionnaires

0 comments

The pith

LLM-powered robots agree with humans on engagement ratings but systematically invert strangeness assessments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests an inverted evaluation approach in which LLM-powered robots complete the same HRI questionnaires that humans use, but from the robot's own perspective. Ratings from five LLMs across 25 interactions matched human ground truth on satisfaction and enjoyment dimensions yet reversed the comfort and strangeness scores. The same inversion appeared when a physical Nao robot performed live, turn-by-turn assessments. The authors conclude that LLMs lack the internal affective states needed to judge strangeness and that supplementary sensor data will be required for reliable robot self-evaluation.

Core claim

LLMs achieve moderate-to-strong agreement with humans on engagement (satisfaction r up to .65, enjoyment r up to .72) with high test-retest reliability, but invert the comfort/strangeness dimension (r = -.44 to -.67). The pattern holds across models, synthetic controls, and embodied deployment.

What carries the argument

Inverted evaluation, in which the LLM completes standardized questionnaires (HRI-CUES, Godspeed, RoSAS) from the robot's perspective for direct comparison to human ratings.

Load-bearing premise

The inversion on strangeness occurs because LLMs lack access to internal affective states rather than because of questionnaire wording or prompt artifacts.

What would settle it

Feeding the LLM physiological signals, gaze, or proxemics data during assessment and checking whether the strangeness correlation shifts from negative to positive would test the claim.

Figures

Figures reproduced from arXiv: 2606.23339 by Hasan Mahmud, Jamison Heard, Mohammad Javad Khojasteh, Prabu David, Victor Lockwood.

**Figure 1.** Figure 1: Self-Report Agreement: Pearson r between LLM ratings and human ground truth across five models and five HRI-CUES dimensions. Engagement dimensions and comfort reached significance (p < .05); quality did not. TABLE III PHASE 2: DESCRIPTIVE STATISTICS, RELIABILITY, AND CROSS-MODEL AGREEMENT FOR GODSPEED AND ROSAS (BASELINE CONDITION). rCROSS = PEARSON r BETWEEN CLAUDE SONNET AND GPT-4O ON PARTICIPANT-LEVEL S… view at source ↗

**Figure 2.** Figure 2: Turn-level divergence (Robot − Human) on HRI-CUES items. Values near zero indicate agreement. Strangeness (pink) shows persistent negative divergence for P1 and P2. P4 used audio-only input (no camera). layer is analogous to a soul that can exist without a body, and whether the robot’s memories persist across sessions. P1 rated strangeness at 4 on every turn and 5 post-hoc, while the Robot rated 1–2 throug… view at source ↗

read the original abstract

Human-robot interaction (HRI) evaluation relies almost exclusively on human-completed questionnaires, leaving the robot's perspective unexamined. We propose an \textit{inverted evaluation}, in which LLM-powered robots complete the same standardized instruments from their own perspective, and test whether these ratings agree with human ground truth. In Study~1, five LLMs completed HRI-CUES, Godspeed, and RoSAS questionnaires for 25~interactions ($N = 1{,}522$ evaluations) from the HRI-CUES dataset. LLMs achieved moderate-to-strong agreement on engagement dimensions (satisfaction $r$ up to $.65$ and enjoyment $r$ up to $.72$) with excellent test-retest reliability (ICC $\geq .82$), but \textit{systematically inverted} the comfort/strangeness dimension ($r = -.44$ to $-.67$, all $p < .05$), conflating engagement with comfort. In Study~2, a Nao robot running Claude~Sonnet~4.5 replicated these patterns in live interactions ($N = 4$), including real-time turn-by-turn assessment. The strangeness failure persisted across five models, synthetic controls, and embodied deployment for two participants. We argue that current LLM-based robots lack access to the internal affective states needed to assess constructs like strangeness, and that inverted evaluation requires supplementary modalities (e.g., physiological signals, gaze, proxemics) to move beyond behavioral proxies. These findings establish boundary conditions for using LLMs as interaction evaluators in HRI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs match humans on engagement ratings but invert strangeness, with the cause left as an untested interpretation.

read the letter

The main point is that this paper tests LLMs rating HRI interactions from the robot's side and finds they line up with human ratings on engagement measures but flip the strangeness and comfort scores. The pattern holds across five models on the HRI-CUES dataset and shows up again in a live Nao robot trial.

The inverted evaluation framing is the actual new piece. They run the same questionnaires (HRI-CUES, Godspeed, RoSAS) on 25 interactions and report moderate positive correlations on satisfaction and enjoyment alongside consistent negative correlations on strangeness. Test-retest reliability comes out high. Adding the embodied Nao run with real-time ratings is a reasonable next step even if the sample stays small.

The dataset results look solid enough on their own terms, with p-values and ICCs provided. The live trial at least checks whether the pattern survives embodiment.

The soft spot is the explanation. The paper attributes the strangeness inversion to LLMs lacking internal affective states, but the data do not test that claim. No prompt ablations, no reworded items, and no non-LLM baselines appear to rule out wording effects or training artifacts. The N=4 live sample also limits how much weight the embodied replication can carry.

This is for HRI researchers who use questionnaires to evaluate robot interactions and are considering LLM-based tools. A reader focused on evaluation methods would get a clear empirical warning about one dimension. The work engages honestly with the existing instruments and reports the numbers plainly.

It deserves peer review. The core observation is reproducible across models and worth referee scrutiny even if the causal step needs more support.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an 'inverted evaluation' method in which LLM-powered robots complete standard HRI questionnaires (HRI-CUES, Godspeed, RoSAS) from the robot's perspective and compares these ratings to human ground truth. Study 1 has five LLMs rate 25 interactions from the HRI-CUES dataset (1,522 total evaluations), finding moderate-to-strong agreement on engagement dimensions (satisfaction r up to .65, enjoyment r up to .72, ICC ≥ .82) but systematic inversion on comfort/strangeness (r = -.44 to -.67, all p < .05). Study 2 replicates the pattern in live Nao robot interactions (N=4) using Claude Sonnet 4.5 with real-time assessment. The authors conclude that LLMs lack access to internal affective states for constructs like strangeness and therefore require supplementary modalities (physiological signals, gaze, proxemics) beyond behavioral proxies.

Significance. If the inversion pattern is robust, the work supplies concrete boundary conditions for deploying LLMs as interaction evaluators in HRI. The consistency across five models plus the embodied live-robot replication constitutes a strength of the empirical design. The findings could usefully caution against sole reliance on LLM self-ratings for affective dimensions and motivate multimodal extensions, provided the causal attribution is clarified.

major comments (2)

[Abstract (argument paragraph)] Abstract (argument paragraph): The claim that the negative correlations on comfort/strangeness reflect LLMs' lack of 'access to the internal affective states' (rather than questionnaire wording, prompt construction, or training-data artifacts) is invoked to justify the recommendation for supplementary modalities. No ablation is reported that isolates this factor (e.g., prompt variants supplying simulated internal-state signals, rephrased strangeness items, or non-LLM baselines also lacking embodiment). This interpretive step is load-bearing for the central boundary-condition argument.
[Study 2] Study 2: The live-robot replication is conducted with N=4, yet the abstract supplies no error bars, exclusion criteria, or full statistical tables for these trials. Given that the inversion pattern is asserted to persist 'across ... embodied deployment,' the small sample size and limited reporting undermine the strength of the embodied evidence relative to the larger Study 1 dataset.

minor comments (2)

[Abstract] Abstract: The statement that the strangeness failure 'persisted across five models, synthetic controls, and embodied deployment for two participants' does not define the synthetic controls or clarify why only two of the four participants are referenced for the embodied case.
[Abstract] Abstract: Correlation values are reported without accompanying sample sizes per dimension or confidence intervals, reducing transparency even though overall N=1,522 is stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating revisions where the manuscript will be updated.

read point-by-point responses

Referee: [Abstract (argument paragraph)] Abstract (argument paragraph): The claim that the negative correlations on comfort/strangeness reflect LLMs' lack of 'access to the internal affective states' (rather than questionnaire wording, prompt construction, or training-data artifacts) is invoked to justify the recommendation for supplementary modalities. No ablation is reported that isolates this factor (e.g., prompt variants supplying simulated internal-state signals, rephrased strangeness items, or non-LLM baselines also lacking embodiment). This interpretive step is load-bearing for the central boundary-condition argument.

Authors: We agree that the manuscript presents no ablation studies to isolate the source of the inversion. The observed pattern is consistent across five LLMs with differing architectures and the live-robot replication, which reduces the likelihood that it arises solely from a single prompt template or wording choice. Nevertheless, this does not constitute a definitive causal demonstration. In revision we will (1) temper the causal phrasing in the abstract and discussion, (2) explicitly list the absence of ablation experiments as a limitation, and (3) frame the call for supplementary modalities as motivated by the empirical failure rather than as proven causation. No new experiments will be added at this stage. revision: partial
Referee: [Study 2] Study 2: The live-robot replication is conducted with N=4, yet the abstract supplies no error bars, exclusion criteria, or full statistical tables for these trials. Given that the inversion pattern is asserted to persist 'across ... embodied deployment,' the small sample size and limited reporting undermine the strength of the embodied evidence relative to the larger Study 1 dataset.

Authors: We accept that the reporting for Study 2 is inadequate. The N=4 comprises four live interactions (two per participant) with no exclusions applied. In the revised manuscript we will expand the Study 2 section to include (1) error bars on all reported correlations, (2) complete statistical tables, (3) a clearer description of the procedure and participant count, and (4) explicit qualification of the replication as preliminary. The abstract will be updated to reflect these limitations while retaining the claim that the pattern was observed in embodied deployment. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of observed correlations

full rationale

The manuscript contains no derivation chain, equations, fitted parameters, or first-principles predictions. All reported results are direct empirical computations (Pearson r, ICC) on questionnaire responses from LLMs and human ground truth. The central pattern (negative correlations on strangeness) is presented as an observed outcome, not derived from any self-referential definition or prior self-citation. The interpretive claim about missing affective states is an untested post-hoc argument in the abstract and discussion; it does not reduce any quantitative result to its own inputs by construction. No self-citation load-bearing steps, ansatzes, or renamings of known results appear in the reported analyses.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard statistical assumptions for correlation and ICC calculations plus the validity of existing HRI questionnaires; no new free parameters, axioms, or invented entities are introduced.

axioms (1)

standard math Pearson correlation and ICC are appropriate for measuring agreement between LLM and human ratings on ordinal questionnaire scales
Invoked implicitly when reporting r and ICC values in Study 1

pith-pipeline@v0.9.1-grok · 5824 in / 1099 out tokens · 22896 ms · 2026-06-26T08:08:42.019539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Measurement in- struments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots,

C. Bartneck, D. Kuli ´c, E. Croft, and S. Zoghbi, “Measurement in- struments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots,”International Journal of Social Robotics, vol. 1, no. 1, pp. 71–81, 2009

2009
[2]

The robotic social attributes scale (RoSAS): Development and validation,

C. M. Carpinella, A. B. Wyman, M. A. Perez, and S. J. Stroessner, “The robotic social attributes scale (RoSAS): Development and validation,” in Proc. ACM/IEEE Int. Conf. Human-Robot Interaction (HRI), pp. 254– 262, 2017

2017
[3]

HRI CUES dataset (anonymized)

B. Irfanet al., “HRI CUES dataset (anonymized).” Zenodo, 2024. CC- BY 4.0

2024
[4]

Building knowledge from interactions: An LLM-based architecture for adaptive tutoring and social reasoning,

L. Garello, G. Belgiovine, G. Russo, F. Rea, and A. Sciutti, “Building knowledge from interactions: An LLM-based architecture for adaptive tutoring and social reasoning,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2025

2025
[5]

Understanding Large-Language Model (LLM)-powered human-robot interaction,

C. Y . Kim, C. P. Lee, and B. Mutlu, “Understanding Large-Language Model (LLM)-powered human-robot interaction,” inProc. ACM/IEEE Int. Conf. Human-Robot Interaction (HRI), pp. 371–380, 2024

2024
[6]

What people share with a robot when feeling lonely and stressed and how it helps over time,

G. Laban, S. Chiang, and H. Gunes, “What people share with a robot when feeling lonely and stressed and how it helps over time,” in2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 1930–1935, IEEE, 2025

1930
[7]

VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,

D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,”IEEE Robotics and Automation Letters, 2024

2024
[8]

Out of one, many: Using language models to simulate human samples,

L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate, “Out of one, many: Using language models to simulate human samples,”Political Analysis, vol. 31, no. 3, pp. 337–351, 2023

2023
[9]

HumanStudy-Bench: Towards AI agent design for participant simu- lation,

X. Liu, H. Shang, Z. Liu, X. Liu, Y . Xiao, Y . Tu, and H. Jin, “HumanStudy-Bench: Towards AI agent design for participant simu- lation,”arXiv preprint arXiv:2602.00685, 2026

work page arXiv 2026
[10]

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

T. Hu, J. Baumann, L. Lupo, D. Hovy, N. Collier, and P. R ¨ottger, “SimBench: Benchmarking the ability of large language models to simulate human behaviors,”arXiv preprint arXiv:2510.17516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Towards understanding sycophancy in language models,

M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. Bowman, et al., “Towards understanding sycophancy in language models,” inPro- ceedings of the International Conference on Learning Representations (ICLR), 2024

2024
[12]

Self-assessment tests are unreliable measures of llm personality,

A. Gupta, X. Song, and G. Anumanchipalli, “Self-assessment tests are unreliable measures of llm personality,” inProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 301–314, 2024

2024
[13]

Do LLMs have distinct and consistent personality? TRAIT: Personality testset designed for LLMs with psychometrics,

S. Lee, S. Lim, S. Han, G. Oh, H. Chae, J. Chung, M. Kim, B.-w. Kwak, et al., “Do LLMs have distinct and consistent personality? TRAIT: Personality testset designed for LLMs with psychometrics,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2024

2024
[14]

Are large language models aligned with people’s social intuitions for human- robot interactions?,

L. Wachowiak, A. Coles, O. Celiktutan, and G. Canal, “Are large language models aligned with people’s social intuitions for human- robot interactions?,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2024

2024
[15]

On the reliability of psychological scales on large language models,

J.-t. Huang, W. Jiao, M. H. Lam, E. J. Li, W. Wang, and M. Lyu, “On the reliability of psychological scales on large language models,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), pp. 6152–6173, 2024

2024
[16]

Large language models fail on trivial alterations to theory- of-mind tasks,

T. Ullman, “Large language models fail on trivial alterations to theory- of-mind tasks,”arXiv preprint arXiv:2302.08399, 2023

work page arXiv 2023
[17]

Social robots are like real people: First impressions, attributes, and stereotyping of social robots,

B. Reeves, J. Hancock, and X. Liu, “Social robots are like real people: First impressions, attributes, and stereotyping of social robots,” Technology, Mind, and Behavior, vol. 1, no. 1, p. 76, 2020

2020
[18]

The social perception of humanoid and non-humanoid robots: Effects of gendered and machinelike features,

S. J. Stroessner and J. Benitez, “The social perception of humanoid and non-humanoid robots: Effects of gendered and machinelike features,” International Journal of Social Robotics, vol. 11, no. 2, pp. 305–315, 2019

2019
[19]

Toward grounded commonsense reasoning,

M. Kwon, H. Hu, V . Myers, S. Karamcheti, A. Dragan, and D. Sadigh, “Toward grounded commonsense reasoning,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), pp. 5463–5470, 2024

2024
[20]

Chat with the environment: Interactive multimodal perception using large language models,

X. Zhao, M. Li, C. Weber, M. B. Hafez, and S. Wermter, “Chat with the environment: Interactive multimodal perception using large language models,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), pp. 3590–3596, 2023

2023
[21]

Between reality and delusion: Challenges of applying large language models to companion robots for open-domain dialogues with older adults,

B. Irfan, S. Kuoppam ¨aki, A. Hosseini, and G. Skantze, “Between reality and delusion: Challenges of applying large language models to companion robots for open-domain dialogues with older adults,” Autonomous Robots, vol. 49, no. 1, p. 9, 2025

2025
[22]

Infusing Theory of Mind into Socially Intelligent LLM Agents

E. Hwanget al., “Infusing theory of mind into socially intelligent LLM agents,”arXiv preprint arXiv:2509.22887, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

SOTOPIA: Interactive evaluation for social intelligence in language agents,

X. Zhou, H. Zhu, L. Mathur, R. Zhang, H. Yu, Z. Qi, L.-P. Morency, Y . Bisk, D. Fried, G. Neubig,et al., “SOTOPIA: Interactive evaluation for social intelligence in language agents,” inProc. Int. Conf. Learning Representations (ICLR), 2024

2024
[24]

Personallm: Investigating the ability of large language models to ex- press personality traits,

H. Jiang, X. Zhang, X. Cao, C. Breazeal, D. Roy, and J. Kabbara, “Personallm: Investigating the ability of large language models to ex- press personality traits,” inFindings of the association for computational linguistics: NAACL 2024, pp. 3605–3627, 2024

2024
[25]

Can LLM “self-report

Y . Zhuet al., “Can LLM “self-report”? exploring the validity of LLM-based self-rating in conversational agents,”arXiv preprint arXiv:2412.00207, 2024

work page arXiv 2024
[26]

Human-robot interaction conversational user enjoyment scale (HRI CUES),

B. Irfan, J. Miniota, S. Thunberg, E. Lagerstedt, S. Kuoppam ¨aki, G. Skantze, and A. Pereira, “Human-robot interaction conversational user enjoyment scale (HRI CUES),”IEEE Transactions on Affective Computing, 2025

2025
[27]

Reporting guidelines for large language models in human–robot interaction,

C. Matuszek, T. Williams, N. DePalma, R. Mead, R. Wen, E. Schneiders, C. Kennington, and A. Bezabih, “Reporting guidelines for large language models in human–robot interaction,”J. Hum.-Robot Interact., vol. 15, Jan. 2026

2026
[28]

Scientists rise up against statistical significance,

V . Amrhein, S. Greenland, and B. McShane, “Scientists rise up against statistical significance,”Nature, vol. 567, no. 7748, pp. 305–307, 2019

2019
[29]

10 years of human-NAO interaction research: A scoping review,

A. Amirova, N. Rakhymbayeva, E. Yadollahi, A. Sandygulova, and W. Johal, “10 years of human-NAO interaction research: A scoping review,”Frontiers in Robotics and AI, vol. 8, p. 744526, 2021

2021
[30]

Robust speech recognition via large-scale weak supervi- sion,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inProceedings of the 40th International Conference on Machine Learning (ICML), pp. 28492–28518, 2023

2023
[31]

Academically intelligent LLMs are not necessarily socially intelligent,

R. Xuet al., “Academically intelligent LLMs are not necessarily socially intelligent,”arXiv preprint arXiv:2403.06591, 2024

work page arXiv 2024
[32]

Affective state estimation for human-robot interaction,

D. Kuli ´c and E. A. Croft, “Affective state estimation for human-robot interaction,”IEEE Transactions on Robotics, vol. 23, no. 5, pp. 991– 1000, 2007

2007
[33]

Addressing data scarcity in multimodal user state recognition by combining semi-supervised and supervised learning,

H. V oß, H. Wersing, and S. Kopp, “Addressing data scarcity in multimodal user state recognition by combining semi-supervised and supervised learning,” inCompanion Publication of the Int. Conf. on Multimodal Interaction (ICMI), 2021

2021
[34]

V ADER: A parsimonious rule-based model for sentiment analysis of social media text,

C. Hutto and E. Gilbert, “V ADER: A parsimonious rule-based model for sentiment analysis of social media text,” inProceedings of the International AAAI Conference on Web and Social Media, vol. 8, pp. 216–225, 2014

2014
[35]

Regulating modal- ity utilization within multimodal fusion networks,

S. Singh, E. Saber, P. P. Markopoulos, and J. Heard, “Regulating modal- ity utilization within multimodal fusion networks,”Sensors, vol. 24, no. 18, p. 6054, 2024

2024

[1] [1]

Measurement in- struments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots,

C. Bartneck, D. Kuli ´c, E. Croft, and S. Zoghbi, “Measurement in- struments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots,”International Journal of Social Robotics, vol. 1, no. 1, pp. 71–81, 2009

2009

[2] [2]

The robotic social attributes scale (RoSAS): Development and validation,

C. M. Carpinella, A. B. Wyman, M. A. Perez, and S. J. Stroessner, “The robotic social attributes scale (RoSAS): Development and validation,” in Proc. ACM/IEEE Int. Conf. Human-Robot Interaction (HRI), pp. 254– 262, 2017

2017

[3] [3]

HRI CUES dataset (anonymized)

B. Irfanet al., “HRI CUES dataset (anonymized).” Zenodo, 2024. CC- BY 4.0

2024

[4] [4]

Building knowledge from interactions: An LLM-based architecture for adaptive tutoring and social reasoning,

L. Garello, G. Belgiovine, G. Russo, F. Rea, and A. Sciutti, “Building knowledge from interactions: An LLM-based architecture for adaptive tutoring and social reasoning,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2025

2025

[5] [5]

Understanding Large-Language Model (LLM)-powered human-robot interaction,

C. Y . Kim, C. P. Lee, and B. Mutlu, “Understanding Large-Language Model (LLM)-powered human-robot interaction,” inProc. ACM/IEEE Int. Conf. Human-Robot Interaction (HRI), pp. 371–380, 2024

2024

[6] [6]

What people share with a robot when feeling lonely and stressed and how it helps over time,

G. Laban, S. Chiang, and H. Gunes, “What people share with a robot when feeling lonely and stressed and how it helps over time,” in2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 1930–1935, IEEE, 2025

1930

[7] [7]

VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,

D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,”IEEE Robotics and Automation Letters, 2024

2024

[8] [8]

Out of one, many: Using language models to simulate human samples,

L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate, “Out of one, many: Using language models to simulate human samples,”Political Analysis, vol. 31, no. 3, pp. 337–351, 2023

2023

[9] [9]

HumanStudy-Bench: Towards AI agent design for participant simu- lation,

X. Liu, H. Shang, Z. Liu, X. Liu, Y . Xiao, Y . Tu, and H. Jin, “HumanStudy-Bench: Towards AI agent design for participant simu- lation,”arXiv preprint arXiv:2602.00685, 2026

work page arXiv 2026

[10] [10]

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

T. Hu, J. Baumann, L. Lupo, D. Hovy, N. Collier, and P. R ¨ottger, “SimBench: Benchmarking the ability of large language models to simulate human behaviors,”arXiv preprint arXiv:2510.17516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Towards understanding sycophancy in language models,

M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. Bowman, et al., “Towards understanding sycophancy in language models,” inPro- ceedings of the International Conference on Learning Representations (ICLR), 2024

2024

[12] [12]

Self-assessment tests are unreliable measures of llm personality,

A. Gupta, X. Song, and G. Anumanchipalli, “Self-assessment tests are unreliable measures of llm personality,” inProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 301–314, 2024

2024

[13] [13]

Do LLMs have distinct and consistent personality? TRAIT: Personality testset designed for LLMs with psychometrics,

S. Lee, S. Lim, S. Han, G. Oh, H. Chae, J. Chung, M. Kim, B.-w. Kwak, et al., “Do LLMs have distinct and consistent personality? TRAIT: Personality testset designed for LLMs with psychometrics,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2024

2024

[14] [14]

Are large language models aligned with people’s social intuitions for human- robot interactions?,

L. Wachowiak, A. Coles, O. Celiktutan, and G. Canal, “Are large language models aligned with people’s social intuitions for human- robot interactions?,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2024

2024

[15] [15]

On the reliability of psychological scales on large language models,

J.-t. Huang, W. Jiao, M. H. Lam, E. J. Li, W. Wang, and M. Lyu, “On the reliability of psychological scales on large language models,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), pp. 6152–6173, 2024

2024

[16] [16]

Large language models fail on trivial alterations to theory- of-mind tasks,

T. Ullman, “Large language models fail on trivial alterations to theory- of-mind tasks,”arXiv preprint arXiv:2302.08399, 2023

work page arXiv 2023

[17] [17]

Social robots are like real people: First impressions, attributes, and stereotyping of social robots,

B. Reeves, J. Hancock, and X. Liu, “Social robots are like real people: First impressions, attributes, and stereotyping of social robots,” Technology, Mind, and Behavior, vol. 1, no. 1, p. 76, 2020

2020

[18] [18]

The social perception of humanoid and non-humanoid robots: Effects of gendered and machinelike features,

S. J. Stroessner and J. Benitez, “The social perception of humanoid and non-humanoid robots: Effects of gendered and machinelike features,” International Journal of Social Robotics, vol. 11, no. 2, pp. 305–315, 2019

2019

[19] [19]

Toward grounded commonsense reasoning,

M. Kwon, H. Hu, V . Myers, S. Karamcheti, A. Dragan, and D. Sadigh, “Toward grounded commonsense reasoning,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), pp. 5463–5470, 2024

2024

[20] [20]

Chat with the environment: Interactive multimodal perception using large language models,

X. Zhao, M. Li, C. Weber, M. B. Hafez, and S. Wermter, “Chat with the environment: Interactive multimodal perception using large language models,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), pp. 3590–3596, 2023

2023

[21] [21]

Between reality and delusion: Challenges of applying large language models to companion robots for open-domain dialogues with older adults,

B. Irfan, S. Kuoppam ¨aki, A. Hosseini, and G. Skantze, “Between reality and delusion: Challenges of applying large language models to companion robots for open-domain dialogues with older adults,” Autonomous Robots, vol. 49, no. 1, p. 9, 2025

2025

[22] [22]

Infusing Theory of Mind into Socially Intelligent LLM Agents

E. Hwanget al., “Infusing theory of mind into socially intelligent LLM agents,”arXiv preprint arXiv:2509.22887, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

SOTOPIA: Interactive evaluation for social intelligence in language agents,

X. Zhou, H. Zhu, L. Mathur, R. Zhang, H. Yu, Z. Qi, L.-P. Morency, Y . Bisk, D. Fried, G. Neubig,et al., “SOTOPIA: Interactive evaluation for social intelligence in language agents,” inProc. Int. Conf. Learning Representations (ICLR), 2024

2024

[24] [24]

Personallm: Investigating the ability of large language models to ex- press personality traits,

H. Jiang, X. Zhang, X. Cao, C. Breazeal, D. Roy, and J. Kabbara, “Personallm: Investigating the ability of large language models to ex- press personality traits,” inFindings of the association for computational linguistics: NAACL 2024, pp. 3605–3627, 2024

2024

[25] [25]

Can LLM “self-report

Y . Zhuet al., “Can LLM “self-report”? exploring the validity of LLM-based self-rating in conversational agents,”arXiv preprint arXiv:2412.00207, 2024

work page arXiv 2024

[26] [26]

Human-robot interaction conversational user enjoyment scale (HRI CUES),

B. Irfan, J. Miniota, S. Thunberg, E. Lagerstedt, S. Kuoppam ¨aki, G. Skantze, and A. Pereira, “Human-robot interaction conversational user enjoyment scale (HRI CUES),”IEEE Transactions on Affective Computing, 2025

2025

[27] [27]

Reporting guidelines for large language models in human–robot interaction,

C. Matuszek, T. Williams, N. DePalma, R. Mead, R. Wen, E. Schneiders, C. Kennington, and A. Bezabih, “Reporting guidelines for large language models in human–robot interaction,”J. Hum.-Robot Interact., vol. 15, Jan. 2026

2026

[28] [28]

Scientists rise up against statistical significance,

V . Amrhein, S. Greenland, and B. McShane, “Scientists rise up against statistical significance,”Nature, vol. 567, no. 7748, pp. 305–307, 2019

2019

[29] [29]

10 years of human-NAO interaction research: A scoping review,

A. Amirova, N. Rakhymbayeva, E. Yadollahi, A. Sandygulova, and W. Johal, “10 years of human-NAO interaction research: A scoping review,”Frontiers in Robotics and AI, vol. 8, p. 744526, 2021

2021

[30] [30]

Robust speech recognition via large-scale weak supervi- sion,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inProceedings of the 40th International Conference on Machine Learning (ICML), pp. 28492–28518, 2023

2023

[31] [31]

Academically intelligent LLMs are not necessarily socially intelligent,

R. Xuet al., “Academically intelligent LLMs are not necessarily socially intelligent,”arXiv preprint arXiv:2403.06591, 2024

work page arXiv 2024

[32] [32]

Affective state estimation for human-robot interaction,

D. Kuli ´c and E. A. Croft, “Affective state estimation for human-robot interaction,”IEEE Transactions on Robotics, vol. 23, no. 5, pp. 991– 1000, 2007

2007

[33] [33]

Addressing data scarcity in multimodal user state recognition by combining semi-supervised and supervised learning,

H. V oß, H. Wersing, and S. Kopp, “Addressing data scarcity in multimodal user state recognition by combining semi-supervised and supervised learning,” inCompanion Publication of the Int. Conf. on Multimodal Interaction (ICMI), 2021

2021

[34] [34]

V ADER: A parsimonious rule-based model for sentiment analysis of social media text,

C. Hutto and E. Gilbert, “V ADER: A parsimonious rule-based model for sentiment analysis of social media text,” inProceedings of the International AAAI Conference on Web and Social Media, vol. 8, pp. 216–225, 2014

2014

[35] [35]

Regulating modal- ity utilization within multimodal fusion networks,

S. Singh, E. Saber, P. P. Markopoulos, and J. Heard, “Regulating modal- ity utilization within multimodal fusion networks,”Sensors, vol. 24, no. 18, p. 6054, 2024

2024