pith. sign in

arxiv: 2604.10107 · v1 · submitted 2026-04-11 · 💻 cs.HC

The Double-Edged Sword of Open-Ended Interaction: How LLM-Driven NPCs Affect Players' Cognitive Load and Gaming Experience

Pith reviewed 2026-05-10 16:22 UTC · model grok-4.3

classification 💻 cs.HC
keywords LLM-NPCscognitive loadgaming experiencenon-player charactershuman-computer interactionAI-driven NPCsuser study
0
0 comments X

The pith

LLM-driven NPCs significantly increase players' cognitive load, but do not yield a statistically significant improvement in overall gaming experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores how non-player characters powered by large language models influence the mental effort players must exert and their overall satisfaction in games. Using a randomized experiment with 130 participants playing a custom prototype, it contrasts LLM-NPCs with traditional scripted NPCs. LLM-NPCs were found to raise cognitive load through increased expressive effort and response uncertainty, yet they failed to improve gaming experience scores significantly. The load effect was stronger in open-ended tasks, while personality traits had minimal impact. This points to the need for careful, context-aware implementation of such AI features in game design.

Core claim

Conducting a randomized between-subject experiment in a self-developed game prototype, the authors found that LLM-NPCs significantly increased players' cognitive load (p < .001), mediated by expressive effort and response uncertainty, but did not significantly improve overall gaming experience (p = .195). LLM-NPCs positively affected perceived autonomy but negatively influenced system usability and trust, with effects varying across task scenarios and limited influence from individual traits.

What carries the argument

The randomized between-subject experiment comparing LLM-NPCs and pre-scripted NPCs in multiple interactive modules of the Campus Culture Week game, with analysis of cognitive load mediation and scenario differences.

Load-bearing premise

The custom game prototype and self-report measures used here generalize to other games and real-world settings without significant confounds from the specific tasks or participant expectations.

What would settle it

A study replicating the experiment in a different game setting or with objective measures of cognitive load instead of self-reports that fails to find a significant increase would undermine the primary result.

read the original abstract

This study examines how large language model-driven non-player characters (LLM-NPCs) affect players' cognitive load and gaming experience, with a particular focus on the underlying psychological mechanisms, differences across task scenarios, and the role of individual traits. Conducting a randomized between-subject experiment (N=130) in a self-developed game prototype "Campus Culture Week", we compared player interactions with LLM-NPCs and traditional pre-scripted NPCs across multiple interactive modules. The results showed that LLM-NPCs significantly increased players' cognitive load (p < .001), an effect mediated by factors such as expressive effort and response uncertainty. However, LLM-NPCs did not yield a statistically significant improvement in overall gaming experience (p = .195); while they positively influenced players' perceived autonomy, they exerted a negative influence on system usability and trust. The effects of LLM-NPCs also significantly varied across task scenarios (p < .001), with stronger increases in cognitive load in more open-ended modules such as content creation and relationship building. The influence of individual differences was generally limited, although the personality traits of extraversion (p = .031) and neuroticism (p = .047) demonstrated some predictive power regarding cognitive load. This study provides empirical evidence for understanding the "double-edged sword" effect of LLM-NPCs on player experience, and highlight the importance of scenario-sensitive and user-sensitive design in intelligent NPC systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. This paper reports results from a randomized between-subjects experiment (N=130) in a custom game prototype called 'Campus Culture Week.' It compares LLM-driven NPCs to traditional pre-scripted NPCs across multiple interactive modules and claims that LLM-NPCs significantly increase players' cognitive load (p < .001), with this effect mediated by expressive effort and response uncertainty. Stronger load increases occur in open-ended modules such as content creation and relationship building. LLM-NPCs do not produce a statistically significant improvement in overall gaming experience (p = .195), although they positively affect perceived autonomy while negatively affecting system usability and trust. Individual differences (extraversion p = .031; neuroticism p = .047) show limited predictive power for cognitive load. The study concludes that LLM-NPCs present a 'double-edged sword' and calls for scenario-sensitive and user-sensitive design.

Significance. If the central empirical claims survive additional controls and validation checks, the work would supply useful evidence on the psychological trade-offs of generative NPCs in games. The between-subjects randomization, focus on task-scenario moderators, and mediation framing are strengths that could inform HCI and game-AI design. The null result on overall experience alongside the load increase is a potentially actionable finding, provided the design isolates LLM-specific mechanisms rather than generic openness or response variability.

major comments (4)
  1. [Results] The Results section reports p < .001 for the cognitive-load increase and p = .195 for the gaming-experience null result, yet provides no effect sizes, confidence intervals, or power analysis. These omissions are load-bearing because they prevent assessment of whether the observed load elevation is practically meaningful and whether the null experience result is informative or under-powered.
  2. [Methods] The Methods section describes the between-subjects comparison of LLM-NPCs versus pre-scripted NPCs but supplies no information on how scripted responses were matched for length, lexical variability, or unpredictability. This is load-bearing for the central claim that effects are attributable to LLM properties rather than inherent differences in interaction openness, especially given the stronger load effects reported in open-ended modules (content creation, relationship building).
  3. [Results] The mediation claim (expressive effort and response uncertainty) is stated in the Abstract and Results but the manuscript does not detail the statistical procedure, whether mediators were measured with validated instruments independent of the outcome scales, or any pre-registration. This is load-bearing because the 'double-edged sword' interpretation rests on these mechanisms being LLM-specific rather than artifacts of self-report demand characteristics.
  4. [Methods] The self-report scales for cognitive load and gaming experience are introduced without reported reliability, validity, or pilot validation data. Given that the headline p-values and mediation rest entirely on these measures, the absence of psychometric information undermines confidence that the directional claims are not inflated by measurement confounds.
minor comments (2)
  1. [Abstract] Abstract: the final sentence contains a subject-verb agreement error ('highlight' should be 'highlights').
  2. [Discussion] The manuscript should clarify in the Limitations or Discussion section how the single custom prototype and participant pool constrain generalizability beyond the specific 'Campus Culture Week' setting.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment point by point below, making revisions to improve statistical reporting, methodological transparency, and psychometric documentation where possible. These changes strengthen the manuscript without altering its core findings.

read point-by-point responses
  1. Referee: [Results] The Results section reports p < .001 for the cognitive-load increase and p = .195 for the gaming-experience null result, yet provides no effect sizes, confidence intervals, or power analysis. These omissions are load-bearing because they prevent assessment of whether the observed load elevation is practically meaningful and whether the null experience result is informative or under-powered.

    Authors: We agree that effect sizes, confidence intervals, and power analysis are necessary for proper interpretation. In the revised manuscript, we have added Cohen's d values (d = 0.72 for the cognitive load increase, a medium-to-large effect; d = 0.15 for gaming experience, a small effect), 95% confidence intervals around all key means and differences, and a post-hoc power analysis (achieved power = 0.92 for the main load effect at N=130). For the null result, the small effect size and adequate power support that it is informative rather than underpowered. revision: yes

  2. Referee: [Methods] The Methods section describes the between-subjects comparison of LLM-NPCs versus pre-scripted NPCs but supplies no information on how scripted responses were matched for length, lexical variability, or unpredictability. This is load-bearing for the central claim that effects are attributable to LLM properties rather than inherent differences in interaction openness, especially given the stronger load effects reported in open-ended modules (content creation, relationship building).

    Authors: This concern is valid for isolating LLM-specific mechanisms. The original manuscript omitted these details. We have added a new Methods subsection describing the matching: scripted responses were pre-generated from pilot data to match LLM averages in length (within ±15 words), lexical variability (type-token ratio targets), and unpredictability (multiple scripted variants per interaction). We acknowledge that scripted responses cannot fully replicate dynamic LLM variability and discuss this as a limitation while maintaining that the design isolates the generative component. revision: yes

  3. Referee: [Results] The mediation claim (expressive effort and response uncertainty) is stated in the Abstract and Results but the manuscript does not detail the statistical procedure, whether mediators were measured with validated instruments independent of the outcome scales, or any pre-registration. This is load-bearing because the 'double-edged sword' interpretation rests on these mechanisms being LLM-specific rather than artifacts of self-report demand characteristics.

    Authors: We appreciate the call for transparency. Mediation followed Baron and Kenny's stepwise regression with Sobel tests; mediators used separate Likert items adapted from prior HCI effort/uncertainty scales, distinct from NASA-TLX cognitive load and GEQ gaming experience measures. The study was not pre-registered. In revision, we have expanded Results with full path coefficients, indirect effects, and mediator item sources, plus a Limitations discussion of demand characteristics and lack of pre-registration. We cannot add pre-registration retroactively. revision: partial

  4. Referee: [Methods] The self-report scales for cognitive load and gaming experience are introduced without reported reliability, validity, or pilot validation data. Given that the headline p-values and mediation rest entirely on these measures, the absence of psychometric information undermines confidence that the directional claims are not inflated by measurement confounds.

    Authors: We agree psychometric details are essential. Cognitive load was adapted from NASA-TLX (Hart & Staveland, 1988; sample α = 0.87) and gaming experience from the Game Experience Questionnaire (IJsselsteijn et al., 2013; sample α = 0.91). Both have established validity. We have added these alphas, original validation citations, and a description of our pilot study (N=15) confirming clarity and consistency to the Methods section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from experiment and statistics

full rationale

The paper is a between-subjects experiment (N=130) reporting p-values and mediation from collected data on cognitive load and gaming experience. No equations, derivations, fitted parameters renamed as predictions, or self-citations invoked as uniqueness theorems or ansatzes appear in the abstract or described methods. All load-bearing claims reduce directly to the experimental outcomes rather than to definitional loops or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical user study relying on standard experimental design and statistical inference rather than novel theory; no free parameters, invented entities, or ad-hoc axioms beyond routine assumptions of self-report validity and statistical test applicability.

axioms (2)
  • domain assumption Self-report instruments validly measure cognitive load and gaming experience constructs
    Central to all reported effects and mediation
  • standard math Statistical assumptions (e.g., independence, approximate normality) hold for reported p-values
    Required for interpreting p < .001 and p = .195

pith-pipeline@v0.9.0 · 5570 in / 1337 out tokens · 52979 ms · 2026-05-10T16:22:30.727270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. "It depends on where AI is used": Players' attitude patterns and evaluative logics toward different AI applications in digital games

    cs.HC 2026-04 unverdicted novelty 5.0

    Player acceptance of AI in digital games depends on the specific application context, with positive views when it enhances immersion and efficiency but negative views when it undermines creativity, autonomy, or fairness.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper

  1. [1]

    Throughout the entire trial process, I often focused on interacting with the game world and characters

  2. [2]

    During the game, I pay less attention to unrelated things around me

  3. [3]

    Overall, this way of interacting with NPCs makes it easier for me to enter the game state

  4. [4]

    I think I am capable of handling the vast majority of tasks in this trial

  5. [5]

    I think I performed well in completing these tasks

  6. [6]

    not knowing what to do

    This way of interacting with NPCs did not keep me in a state of “not knowing what to do” for a long time. Expression cost dimensions

  7. [7]

    When interacting with NPCs, I need to spend a lot of effort thinking or organizing the content I want to input/express

  8. [8]

    This way of interacting with NPCs makes me feel like expressing myself is a burden

  9. [9]

    how to reply to NPCs

    I often need to think extra about “how to reply to NPCs” instead of just “what to accomplish in the game”. Autonomy dimension

  10. [10]

    I feel like I can push forward the interaction with NPCs according to my own ideas

  11. [11]

    I am able to express my intentions to NPCs in the way I want without being heavily restricted by the system

  12. [12]

    Dimension of Presence

    Overall, I think this way of interacting with NPCs gives me a lot of freedom. Dimension of Presence

  13. [13]

    real, responsive character

    I often feel like I'm interacting with a “real, responsive character”

  14. [14]

    The response of the characters in the game made me feel like they were “present”

  15. [15]

    Response uncertainty dimensions

    When interacting with in-game characters, I can feel their perception and response to my expression. Response uncertainty dimensions

  16. [16]

    When interacting with NPCs, I am not sure how to express myself in order to consistently receive the expected response

  17. [17]

    Target clarity dimension

    In some interactions with NPCs, I am not sure if the character truly understands my intentions. Target clarity dimension

  18. [18]

    In most modules, I am clear about what I need to accomplish

  19. [19]

    System usability dimension

    In the game, I am able to understand the direction of progress for different dialogue and interactive tasks. System usability dimension

  20. [20]

    I think this way of interacting with NPCs is generally easier to get started with

  21. [21]

    I usually know quickly how to communicate with characters in the game

  22. [22]

    The overall process of interacting with NPCs in this way is smooth

  23. [23]

    Trust dimension

    From an operational perspective, this way of interacting with NPCs does not impose too much additional burden on me. Trust dimension

  24. [24]

    I believe that NPCs in the game can usually understand my expression and provide useful responses

  25. [25]

    Preference/Satisfaction

    I believe that the dialogue performance of NPCs in the game is reliable enough. Preference/Satisfaction

  26. [26]

    Compared to another way of interaction, if there is a similar way of interacting with NPCs in other games, I prefer to rely on it to complete tasks

  27. [27]

    Open-ended questions

    I think this NPC interaction method has overall improved my gaming experience. Open-ended questions

  28. [28]

    Which of the seven task modules do you think is the most suitable for interacting with NPCs in this way? ____

  29. [29]

    What do you think is the reason why this module is most suitable for interacting with NPCs among the seven task modules? ____

  30. [30]

    Which of the seven task modules do you think is the least suitable for interacting with NPCs in this way? ____

  31. [31]

    What do you think is the reason why this module is the least suitable for interacting with NPCs among the seven task modules? ____

  32. [32]

    Which of the seven modules would you like to experience again? ____

  33. [33]

    Among the seven modules, what is the reason why you are willing to experience this module again? ______

  34. [34]

    At which moments in the game do you feel like you are truly interacting with an NPC? ____

  35. [35]

    At which moments in the game are you most confused or unsure of what to do? ____

  36. [36]

    What do you think is the most important point for improvement in this way of interacting with NPCs? ____

  37. [37]

    natural language interaction

    What do you think are the most remarkable advantages and disadvantages of comparing “natural language interaction” and “preset option interaction”? ____ Appendix D. Coding Manual

  38. [38]

    smallest meaningful unit in a single answer

    Encoding method This study was initially coded independently by two researchers, and one researcher was responsible for coordinating disagreements and final approval. All coding personnel receive unified training before formal coding, familiarizing themselves with research questions, questionnaire structures, definitions of seven types of task modules, an...

  39. [39]

    Deal with missing content that is clearly invalid (such as answering with 'none')

    Encoding process The coding process of this study is divided into five steps: (1) Material organization: The researcher first exports all open-ended answer texts and organizes them by question number, group (LLM-NPC group × traditional NPC group), and corresponding module. Deal with missing content that is clearly invalid (such as answering with 'none'). ...

  40. [40]

    Reduce thinking burden: Refers to not having to organize language extensively, reducing thinking pressure, and lowering cognitive consumption

    Example of Core Theme Definition The following topics are typical categories that frequently appear in the open feedback of this study, serving as important references for formal coding: (1) Reasons for suitability/unsuitability of a certain interaction method The reasoning process is clearer: preset options help to organize information, clarify logic, an...