The Double-Edged Sword of Open-Ended Interaction: How LLM-Driven NPCs Affect Players' Cognitive Load and Gaming Experience
Pith reviewed 2026-05-10 16:22 UTC · model grok-4.3
The pith
LLM-driven NPCs significantly increase players' cognitive load, but do not yield a statistically significant improvement in overall gaming experience.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Conducting a randomized between-subject experiment in a self-developed game prototype, the authors found that LLM-NPCs significantly increased players' cognitive load (p < .001), mediated by expressive effort and response uncertainty, but did not significantly improve overall gaming experience (p = .195). LLM-NPCs positively affected perceived autonomy but negatively influenced system usability and trust, with effects varying across task scenarios and limited influence from individual traits.
What carries the argument
The randomized between-subject experiment comparing LLM-NPCs and pre-scripted NPCs in multiple interactive modules of the Campus Culture Week game, with analysis of cognitive load mediation and scenario differences.
Load-bearing premise
The custom game prototype and self-report measures used here generalize to other games and real-world settings without significant confounds from the specific tasks or participant expectations.
What would settle it
A study replicating the experiment in a different game setting or with objective measures of cognitive load instead of self-reports that fails to find a significant increase would undermine the primary result.
read the original abstract
This study examines how large language model-driven non-player characters (LLM-NPCs) affect players' cognitive load and gaming experience, with a particular focus on the underlying psychological mechanisms, differences across task scenarios, and the role of individual traits. Conducting a randomized between-subject experiment (N=130) in a self-developed game prototype "Campus Culture Week", we compared player interactions with LLM-NPCs and traditional pre-scripted NPCs across multiple interactive modules. The results showed that LLM-NPCs significantly increased players' cognitive load (p < .001), an effect mediated by factors such as expressive effort and response uncertainty. However, LLM-NPCs did not yield a statistically significant improvement in overall gaming experience (p = .195); while they positively influenced players' perceived autonomy, they exerted a negative influence on system usability and trust. The effects of LLM-NPCs also significantly varied across task scenarios (p < .001), with stronger increases in cognitive load in more open-ended modules such as content creation and relationship building. The influence of individual differences was generally limited, although the personality traits of extraversion (p = .031) and neuroticism (p = .047) demonstrated some predictive power regarding cognitive load. This study provides empirical evidence for understanding the "double-edged sword" effect of LLM-NPCs on player experience, and highlight the importance of scenario-sensitive and user-sensitive design in intelligent NPC systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper reports results from a randomized between-subjects experiment (N=130) in a custom game prototype called 'Campus Culture Week.' It compares LLM-driven NPCs to traditional pre-scripted NPCs across multiple interactive modules and claims that LLM-NPCs significantly increase players' cognitive load (p < .001), with this effect mediated by expressive effort and response uncertainty. Stronger load increases occur in open-ended modules such as content creation and relationship building. LLM-NPCs do not produce a statistically significant improvement in overall gaming experience (p = .195), although they positively affect perceived autonomy while negatively affecting system usability and trust. Individual differences (extraversion p = .031; neuroticism p = .047) show limited predictive power for cognitive load. The study concludes that LLM-NPCs present a 'double-edged sword' and calls for scenario-sensitive and user-sensitive design.
Significance. If the central empirical claims survive additional controls and validation checks, the work would supply useful evidence on the psychological trade-offs of generative NPCs in games. The between-subjects randomization, focus on task-scenario moderators, and mediation framing are strengths that could inform HCI and game-AI design. The null result on overall experience alongside the load increase is a potentially actionable finding, provided the design isolates LLM-specific mechanisms rather than generic openness or response variability.
major comments (4)
- [Results] The Results section reports p < .001 for the cognitive-load increase and p = .195 for the gaming-experience null result, yet provides no effect sizes, confidence intervals, or power analysis. These omissions are load-bearing because they prevent assessment of whether the observed load elevation is practically meaningful and whether the null experience result is informative or under-powered.
- [Methods] The Methods section describes the between-subjects comparison of LLM-NPCs versus pre-scripted NPCs but supplies no information on how scripted responses were matched for length, lexical variability, or unpredictability. This is load-bearing for the central claim that effects are attributable to LLM properties rather than inherent differences in interaction openness, especially given the stronger load effects reported in open-ended modules (content creation, relationship building).
- [Results] The mediation claim (expressive effort and response uncertainty) is stated in the Abstract and Results but the manuscript does not detail the statistical procedure, whether mediators were measured with validated instruments independent of the outcome scales, or any pre-registration. This is load-bearing because the 'double-edged sword' interpretation rests on these mechanisms being LLM-specific rather than artifacts of self-report demand characteristics.
- [Methods] The self-report scales for cognitive load and gaming experience are introduced without reported reliability, validity, or pilot validation data. Given that the headline p-values and mediation rest entirely on these measures, the absence of psychometric information undermines confidence that the directional claims are not inflated by measurement confounds.
minor comments (2)
- [Abstract] Abstract: the final sentence contains a subject-verb agreement error ('highlight' should be 'highlights').
- [Discussion] The manuscript should clarify in the Limitations or Discussion section how the single custom prototype and participant pool constrain generalizability beyond the specific 'Campus Culture Week' setting.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have addressed each major comment point by point below, making revisions to improve statistical reporting, methodological transparency, and psychometric documentation where possible. These changes strengthen the manuscript without altering its core findings.
read point-by-point responses
-
Referee: [Results] The Results section reports p < .001 for the cognitive-load increase and p = .195 for the gaming-experience null result, yet provides no effect sizes, confidence intervals, or power analysis. These omissions are load-bearing because they prevent assessment of whether the observed load elevation is practically meaningful and whether the null experience result is informative or under-powered.
Authors: We agree that effect sizes, confidence intervals, and power analysis are necessary for proper interpretation. In the revised manuscript, we have added Cohen's d values (d = 0.72 for the cognitive load increase, a medium-to-large effect; d = 0.15 for gaming experience, a small effect), 95% confidence intervals around all key means and differences, and a post-hoc power analysis (achieved power = 0.92 for the main load effect at N=130). For the null result, the small effect size and adequate power support that it is informative rather than underpowered. revision: yes
-
Referee: [Methods] The Methods section describes the between-subjects comparison of LLM-NPCs versus pre-scripted NPCs but supplies no information on how scripted responses were matched for length, lexical variability, or unpredictability. This is load-bearing for the central claim that effects are attributable to LLM properties rather than inherent differences in interaction openness, especially given the stronger load effects reported in open-ended modules (content creation, relationship building).
Authors: This concern is valid for isolating LLM-specific mechanisms. The original manuscript omitted these details. We have added a new Methods subsection describing the matching: scripted responses were pre-generated from pilot data to match LLM averages in length (within ±15 words), lexical variability (type-token ratio targets), and unpredictability (multiple scripted variants per interaction). We acknowledge that scripted responses cannot fully replicate dynamic LLM variability and discuss this as a limitation while maintaining that the design isolates the generative component. revision: yes
-
Referee: [Results] The mediation claim (expressive effort and response uncertainty) is stated in the Abstract and Results but the manuscript does not detail the statistical procedure, whether mediators were measured with validated instruments independent of the outcome scales, or any pre-registration. This is load-bearing because the 'double-edged sword' interpretation rests on these mechanisms being LLM-specific rather than artifacts of self-report demand characteristics.
Authors: We appreciate the call for transparency. Mediation followed Baron and Kenny's stepwise regression with Sobel tests; mediators used separate Likert items adapted from prior HCI effort/uncertainty scales, distinct from NASA-TLX cognitive load and GEQ gaming experience measures. The study was not pre-registered. In revision, we have expanded Results with full path coefficients, indirect effects, and mediator item sources, plus a Limitations discussion of demand characteristics and lack of pre-registration. We cannot add pre-registration retroactively. revision: partial
-
Referee: [Methods] The self-report scales for cognitive load and gaming experience are introduced without reported reliability, validity, or pilot validation data. Given that the headline p-values and mediation rest entirely on these measures, the absence of psychometric information undermines confidence that the directional claims are not inflated by measurement confounds.
Authors: We agree psychometric details are essential. Cognitive load was adapted from NASA-TLX (Hart & Staveland, 1988; sample α = 0.87) and gaming experience from the Game Experience Questionnaire (IJsselsteijn et al., 2013; sample α = 0.91). Both have established validity. We have added these alphas, original validation citations, and a description of our pilot study (N=15) confirming clarity and consistency to the Methods section. revision: yes
Circularity Check
No circularity: empirical results from experiment and statistics
full rationale
The paper is a between-subjects experiment (N=130) reporting p-values and mediation from collected data on cognitive load and gaming experience. No equations, derivations, fitted parameters renamed as predictions, or self-citations invoked as uniqueness theorems or ansatzes appear in the abstract or described methods. All load-bearing claims reduce directly to the experimental outcomes rather than to definitional loops or prior self-work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Self-report instruments validly measure cognitive load and gaming experience constructs
- standard math Statistical assumptions (e.g., independence, approximate normality) hold for reported p-values
Forward citations
Cited by 1 Pith paper
-
"It depends on where AI is used": Players' attitude patterns and evaluative logics toward different AI applications in digital games
Player acceptance of AI in digital games depends on the specific application context, with positive views when it enhances immersion and efficiency but negative views when it undermines creativity, autonomy, or fairness.
Reference graph
Works this paper leans on
-
[1]
Throughout the entire trial process, I often focused on interacting with the game world and characters
-
[2]
During the game, I pay less attention to unrelated things around me
-
[3]
Overall, this way of interacting with NPCs makes it easier for me to enter the game state
-
[4]
I think I am capable of handling the vast majority of tasks in this trial
-
[5]
I think I performed well in completing these tasks
-
[6]
This way of interacting with NPCs did not keep me in a state of “not knowing what to do” for a long time. Expression cost dimensions
-
[7]
When interacting with NPCs, I need to spend a lot of effort thinking or organizing the content I want to input/express
-
[8]
This way of interacting with NPCs makes me feel like expressing myself is a burden
-
[9]
I often need to think extra about “how to reply to NPCs” instead of just “what to accomplish in the game”. Autonomy dimension
-
[10]
I feel like I can push forward the interaction with NPCs according to my own ideas
-
[11]
I am able to express my intentions to NPCs in the way I want without being heavily restricted by the system
-
[12]
Overall, I think this way of interacting with NPCs gives me a lot of freedom. Dimension of Presence
-
[13]
I often feel like I'm interacting with a “real, responsive character”
-
[14]
The response of the characters in the game made me feel like they were “present”
-
[15]
Response uncertainty dimensions
When interacting with in-game characters, I can feel their perception and response to my expression. Response uncertainty dimensions
-
[16]
When interacting with NPCs, I am not sure how to express myself in order to consistently receive the expected response
-
[17]
In some interactions with NPCs, I am not sure if the character truly understands my intentions. Target clarity dimension
-
[18]
In most modules, I am clear about what I need to accomplish
-
[19]
In the game, I am able to understand the direction of progress for different dialogue and interactive tasks. System usability dimension
-
[20]
I think this way of interacting with NPCs is generally easier to get started with
-
[21]
I usually know quickly how to communicate with characters in the game
-
[22]
The overall process of interacting with NPCs in this way is smooth
-
[23]
From an operational perspective, this way of interacting with NPCs does not impose too much additional burden on me. Trust dimension
-
[24]
I believe that NPCs in the game can usually understand my expression and provide useful responses
-
[25]
I believe that the dialogue performance of NPCs in the game is reliable enough. Preference/Satisfaction
-
[26]
Compared to another way of interaction, if there is a similar way of interacting with NPCs in other games, I prefer to rely on it to complete tasks
-
[27]
I think this NPC interaction method has overall improved my gaming experience. Open-ended questions
-
[28]
Which of the seven task modules do you think is the most suitable for interacting with NPCs in this way? ____
-
[29]
What do you think is the reason why this module is most suitable for interacting with NPCs among the seven task modules? ____
-
[30]
Which of the seven task modules do you think is the least suitable for interacting with NPCs in this way? ____
-
[31]
What do you think is the reason why this module is the least suitable for interacting with NPCs among the seven task modules? ____
-
[32]
Which of the seven modules would you like to experience again? ____
-
[33]
Among the seven modules, what is the reason why you are willing to experience this module again? ______
-
[34]
At which moments in the game do you feel like you are truly interacting with an NPC? ____
-
[35]
At which moments in the game are you most confused or unsure of what to do? ____
-
[36]
What do you think is the most important point for improvement in this way of interacting with NPCs? ____
-
[37]
What do you think are the most remarkable advantages and disadvantages of comparing “natural language interaction” and “preset option interaction”? ____ Appendix D. Coding Manual
-
[38]
smallest meaningful unit in a single answer
Encoding method This study was initially coded independently by two researchers, and one researcher was responsible for coordinating disagreements and final approval. All coding personnel receive unified training before formal coding, familiarizing themselves with research questions, questionnaire structures, definitions of seven types of task modules, an...
-
[39]
Deal with missing content that is clearly invalid (such as answering with 'none')
Encoding process The coding process of this study is divided into five steps: (1) Material organization: The researcher first exports all open-ended answer texts and organizes them by question number, group (LLM-NPC group × traditional NPC group), and corresponding module. Deal with missing content that is clearly invalid (such as answering with 'none'). ...
-
[40]
Example of Core Theme Definition The following topics are typical categories that frequently appear in the open feedback of this study, serving as important references for formal coding: (1) Reasons for suitability/unsuitability of a certain interaction method The reasoning process is clearer: preset options help to organize information, clarify logic, an...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.