Human-AI Collaborative Game Testing with Vision Language Models

Boran Zhang; Muhan Xu; Zhijun Pan

arxiv: 2501.11782 · v2 · submitted 2025-01-20 · 💻 cs.HC · cs.AI

Human-AI Collaborative Game Testing with Vision Language Models

Boran Zhang , Muhan Xu , Zhijun Pan This is my paper

Pith reviewed 2026-05-23 05:03 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords Human-AI collaborationGame testingVision language modelsDefect detectionAI-assisted workflowsHuman performanceVideo game quality assurance

0 comments

The pith

AI assistance with vision language models significantly improves human defect identification in game testing, especially when paired with detailed knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether AI using vision language models can help human game testers identify defects more accurately. An experiment with 276 participants across 800 test cases compares performance with and without AI support, and with and without detailed knowledge of defects and design. AI assistance leads to better defect identification overall, with the strongest effect when combined with detailed knowledge. However, when the AI makes mistakes, it can negatively influence human decisions. These findings point to the need for careful design of human-AI collaboration in testing processes to maximize benefits while minimizing risks from AI errors.

Core claim

The authors develop an AI-assisted workflow for game testing that uses vision language models to detect defects. Through controlled experiments involving 800 test cases and 276 participants of varying backgrounds, they compare four conditions varying the presence of AI support and detailed defect/design knowledge. The results show that AI assistance significantly improves defect identification performance, particularly when paired with detailed knowledge. Challenges arise when AI errors occur, negatively impacting human decision-making. The study concludes that optimizing human-AI collaboration and mitigating AI inaccuracies is important for enhancing efficiency and accuracy in game testing.

What carries the argument

The AI-assisted workflow leveraging vision language models for defect detection, evaluated through a four-condition experiment on human testers.

If this is right

Human testers identify more defects when given AI assistance than without it.
The largest performance gains occur when AI support is combined with detailed defect and design knowledge.
AI errors can reduce the accuracy of human decisions in defect identification.
Strategies to mitigate the effects of AI inaccuracies are required for reliable human-AI collaboration.
AI integration can enhance efficiency and accuracy in game testing workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same workflow could be tested in non-game software testing where visual elements are inspected for defects.
Adding AI confidence scores or override prompts to the interface might reduce the negative impact of AI mistakes on humans.
Companies using this approach would likely need new training protocols so testers learn to question AI outputs.
Providing design documentation alongside AI tools may become a standard practice in collaborative testing setups.

Load-bearing premise

The 800 test cases and four experimental conditions sufficiently isolate the effect of AI assistance without major confounding from participant skill variation or test case selection bias.

What would settle it

A replication experiment using a new participant group and different test cases that finds no significant improvement in defect identification rates from AI assistance would disprove the central claim.

Figures

Figures reproduced from arXiv: 2501.11782 by Boran Zhang, Muhan Xu, Zhijun Pan.

**Figure 2.** Figure 2: AI-Assisted Game Testing Workflow (Simplified) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Test accuracy distribution. This image illustrates the maximum, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

As modern video games become increasingly complex, traditional manual testing methods are proving costly and inefficient, limiting the ability to ensure high-quality game experiences. While advancements in Artificial Intelligence (AI) offer the potential to assist human testers, the effectiveness of AI in truly enhancing real-world human performance remains underexplored. This study investigates how AI can improve game testing by developing and experimenting with an AI-assisted workflow that leverages state-of-the-art machine learning models for defect detection. Through an experiment involving 800 test cases and 276 participants of varying backgrounds, we evaluate the effectiveness of AI assistance under four conditions: with or without AI support, and with or without detailed knowledge of defects and design documentation. The results indicate that AI assistance significantly improves defect identification performance, particularly when paired with detailed knowledge. However, challenges arise when AI errors occur, negatively impacting human decision-making. Our findings show the importance of optimizing human-AI collaboration and implementing strategies to mitigate the effects of AI inaccuracies. By this research, we demonstrate AI's potential and problems in enhancing efficiency and accuracy in game testing workflows and offers practical insights for integrating AI into the testing process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript describes an empirical user study on human-AI collaboration for video game defect detection using vision language models. It reports a 2×2 between-subjects experiment with 276 participants of varying backgrounds and 800 test cases across four conditions (AI support present/absent crossed with detailed defect/design knowledge present/absent). The central claim is that AI assistance significantly improves defect identification performance, especially when paired with detailed knowledge, while also noting that AI errors can negatively affect human decisions.

Significance. If the reported performance gains prove robust after proper statistical controls, the work would offer concrete, domain-specific evidence on the benefits and risks of human-AI collaboration in game testing—an area of growing practical importance in HCI and software quality assurance. The inclusion of an AI-error condition and discussion of mitigation strategies is a strength that moves beyond purely positive framing.

major comments (3)

[Abstract] Abstract: the headline claim that 'AI assistance significantly improves defect identification performance' is presented without any statistical details (p-values, effect sizes, confidence intervals, error bars, or power analysis), rendering the central empirical result unverifiable from the provided information.
[Abstract] Abstract (experiment description): the 2×2 design with 276 participants and 800 test cases supplies no information on (a) randomization or balancing of participants across conditions, (b) measurement or stratification by pre-existing testing skill, or (c) sampling/stratification of the 800 test cases; these omissions directly threaten attribution of any observed lift to AI assistance rather than confounds.
[Abstract] Abstract: the results mention 'challenges arise when AI errors occur' but provide no definition of defects, no breakdown of how AI errors were identified or quantified, and no analysis of their impact on the performance metrics, leaving the handling of the negative case unexamined.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on the abstract. We address each point below and will revise the abstract in the next version to improve clarity and verifiability while preserving conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 'AI assistance significantly improves defect identification performance' is presented without any statistical details (p-values, effect sizes, confidence intervals, error bars, or power analysis), rendering the central empirical result unverifiable from the provided information.

Authors: We agree the abstract should include key statistical details. The full manuscript reports these in the Results section from the 2×2 experiment (including p-values, effect sizes, and confidence intervals from appropriate tests such as ANOVA or regression models). We will revise the abstract to add a concise summary of the main statistical findings supporting the headline claim. revision: yes
Referee: [Abstract] Abstract (experiment description): the 2×2 design with 276 participants and 800 test cases supplies no information on (a) randomization or balancing of participants across conditions, (b) measurement or stratification by pre-existing testing skill, or (c) sampling/stratification of the 800 test cases; these omissions directly threaten attribution of any observed lift to AI assistance rather than confounds.

Authors: The Methods section of the full manuscript details random assignment of participants to the four conditions with balancing on background variables, pre-experiment measurement of testing experience for stratification/control, and the sampling procedure for the 800 test cases (covering diverse defect categories). We acknowledge the abstract omits these controls. We will add a brief clause to the abstract summarizing the randomization, balancing, and sampling approach. revision: yes
Referee: [Abstract] Abstract: the results mention 'challenges arise when AI errors occur' but provide no definition of defects, no breakdown of how AI errors were identified or quantified, and no analysis of their impact on the performance metrics, leaving the handling of the negative case unexamined.

Authors: The full manuscript defines defects in the Methods, describes the AI-error condition (including how erroneous outputs were generated and quantified), and analyzes their impact on human decisions in the Results. We will revise the abstract to briefly note the defect definition and the analysis of AI-error effects. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical user study with direct measurements

full rationale

The paper reports results from a 2×2 between-subjects experiment (276 participants, 800 test cases) comparing AI assistance and knowledge conditions on defect identification. No equations, derivations, fitted parameters, or predictive models are presented. Claims rest on observed performance differences across conditions, not on any self-referential construction or self-citation chain. The reader's assessment of 0.0 circularity is confirmed by the absence of any load-bearing mathematical or definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen test cases and participant pool are representative of real game testing; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The selected 800 test cases and 276 participants of varying backgrounds represent typical game testing scenarios and allow generalization of AI assistance effects.
Invoked to support claims about real-world workflow improvements from the four-condition experiment.

pith-pipeline@v0.9.0 · 5726 in / 1144 out tokens · 59203 ms · 2026-05-23T05:03:57.124756+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

[1]

Levy and J

L. Levy and J. Novak, Game development essentials: Game QA & testing. Delmar Learning, 2009

work page 2009
[2]

Video game values: Human–computer interaction and games,

P. Barr, J. Noble, and R. Biddle, “Video game values: Human–computer interaction and games,” Interacting with computers , vol. 19, no. 2, pp. 180–195, 2007

work page 2007
[3]

Ag3: Automated game gui text glitch detection based on computer vision,

X. Liang, J. Qi, Y . Gao, C. Peng, and P. Yang, “Ag3: Automated game gui text glitch detection based on computer vision,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2023, pp. 1879–1890

work page 2023
[4]

The consolidation of game software engineering: A systematic literature review of soft- ware engineering for industry-scale computer games,

J. Chueca, J. Ver ´on, J. Font, F. P ´erez, and C. Cetina, “The consolidation of game software engineering: A systematic literature review of soft- ware engineering for industry-scale computer games,” Information and Software Technology, vol. 165, p. 107330, 2024

work page 2024
[5]

A survey of video game testing,

C. Politowski, F. Petrillo, and Y .-G. Gu´eh´eneuc, “A survey of video game testing,” in 2021 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 2021, pp. 90–99

work page 2021
[6]

Automated game testing with icarus: Intelligent completion of adventure riddles via unsupervised solving,

J. Pfau, J. D. Smeddinck, and R. Malaka, “Automated game testing with icarus: Intelligent completion of adventure riddles via unsupervised solving,” in Extended abstracts publication of the annual symposium on computer-human interaction in play , 2017, pp. 153–164

work page 2017
[7]

Quantizing large- language models for predicting flaky tests,

S. Rahman, A. Baz, S. Misailovic, and A. Shi, “Quantizing large- language models for predicting flaky tests,” in 2024 IEEE Conference on Software Testing, Verification and Validation (ICST) . IEEE, 2024, pp. 93–104

work page 2024
[8]

A new approach in development of distributed framework for automated software testing using agents,

P. Dhavachelvan, G. Uma, and V . Venkatachalapathy, “A new approach in development of distributed framework for automated software testing using agents,” Knowledge-Based Systems, vol. 19, no. 4, pp. 235–247, 2006

work page 2006
[9]

Development of game testing method for measuring game quality,

R. Ramadan and B. Hendradjaya, “Development of game testing method for measuring game quality,” in 2014 International Conference on Data and Software Engineering (ICODSE) . IEEE, 2014, pp. 1–6

work page 2014
[10]

A video game testing method utilizing deep learning,

M. R. Taesiri, M. Habibi, and M. A. Fazli, “A video game testing method utilizing deep learning,” Iran Journal of Computer Science , vol. 17, no. 2, 2020

work page 2020
[11]

Machine learning applied to software testing: A systematic mapping study,

V . H. Durelli, R. S. Durelli, S. S. Borges, A. T. Endo, M. M. Eler, D. R. Dias, and M. P. Guimar˜aes, “Machine learning applied to software testing: A systematic mapping study,” IEEE Transactions on Reliability, vol. 68, no. 3, pp. 1189–1212, 2019

work page 2019
[12]

Automated game testing using computer vision methods,

C. Paduraru, M. Paduraru, and A. Stefanescu, “Automated game testing using computer vision methods,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) . IEEE, 2021, pp. 65–72

work page 2021
[13]

Astrobug: Automatic game bug detection using deep learning,

E. Azizi and L. Zaman, “Astrobug: Automatic game bug detection using deep learning,” IEEE Transactions on Games , 2024

work page 2024
[14]

Automated evaluation of game content display using deep learning,

C. Paduraru, M. Cernat, and A. Stefanescu, “Automated evaluation of game content display using deep learning,” in Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 2024, pp. 421–424

work page 2024
[15]

Inspector: Pixel-based automated game testing via exploration, detection, and investigation,

G. Liu, M. Cai, L. Zhao, T. Qin, A. Brown, J. Bischoff, and T.-Y . Liu, “Inspector: Pixel-based automated game testing via exploration, detection, and investigation,” in 2022 IEEE Conference on Games (CoG). IEEE, 2022, pp. 237–244

work page 2022
[16]

Supernova: Automating test selection and defect prevention in aaa video games using risk based testing and machine learning,

A. Senchenko, N. Patterson, H. Samuel, and D. Ispir, “Supernova: Automating test selection and defect prevention in aaa video games using risk based testing and machine learning,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 345–354

work page 2022
[17]

Restful api automated test case generation,

A. Arcuri, “Restful api automated test case generation,” in 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2017, pp. 9–20

work page 2017
[18]

Automated video game testing using synthetic and humanlike agents,

S. Ariyurek, A. Betin-Can, and E. Surer, “Automated video game testing using synthetic and humanlike agents,” IEEE Transactions on Games , vol. 13, no. 1, pp. 50–67, 2019

work page 2019
[19]

Artificial intelligence (ai): Multidisciplinary perspectives on emerging challenges, opportuni- ties, and agenda for research, practice and policy,

Y . K. Dwivedi, L. Hughes, E. Ismagilova, G. Aarts, C. Coombs, T. Crick, Y . Duan, R. Dwivedi, J. Edwards, A. Eiruget al., “Artificial intelligence (ai): Multidisciplinary perspectives on emerging challenges, opportuni- ties, and agenda for research, practice and policy,” International journal of information management , vol. 57, p. 101994, 2021

work page 2021
[20]

Explanations con- sidered harmful: The impact of misleading explanations on accuracy in hybrid human-ai decision making,

F. Cabitza, C. Fregosi, A. Campagner, and C. Natali, “Explanations con- sidered harmful: The impact of misleading explanations on accuracy in hybrid human-ai decision making,” in World Conference on Explainable Artificial Intelligence. Springer, 2024, pp. 255–269

work page 2024
[21]

From human-computer interaction to human-ai interaction: new challenges and opportunities for enabling human-centered ai,

W. Xu, M. J. Dainoff, L. Ge, and Z. Gao, “From human-computer interaction to human-ai interaction: new challenges and opportunities for enabling human-centered ai,” arXiv preprint arXiv:2105.05424 , vol. 5, 2021

work page arXiv 2021
[22]

A Survey of Hallucination in Large Foundation Models

V . Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Y . Zhang, Y . Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y . Zhang, Y . Chenet al., “Siren’s song in the ai ocean: a survey on hal- lucination in large language models,” arXiv preprint arXiv:2309.01219, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Contrastive learning reduces hallucination in conversations,

W. Sun, Z. Shi, S. Gao, P. Ren, M. de Rijke, and Z. Ren, “Contrastive learning reduces hallucination in conversations,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 11, 2023, pp. 13 618–13 626

work page 2023
[25]

Honest ai: Fine- tuning

X. Chen, L. Wang, W. Wu, Q. Tang, and Y . Liu, “Honest ai: Fine- tuning” small” language models to say” i don’t know”, and reducing hallucination in rag,” in 2024 KDD Cup Workshop for Retrieval Aug- mented Generation, 2024

work page 2024
[26]

Usertesting without the user: Opportunities and challenges of an ai-driven approach in games user research,

S. N. Stahlke and P. Mirza-Babaei, “Usertesting without the user: Opportunities and challenges of an ai-driven approach in games user research,” Computers in Entertainment (CIE) , vol. 16, no. 2, pp. 1–18, 2018

work page 2018
[27]

Visual glitches classification for video game using deep learning-based techniques,

S. Jintawatsakoon, “Visual glitches classification for video game using deep learning-based techniques,” in 2023 27th International Computer Science and Engineering Conference (ICSEC) . IEEE, 2023, pp. 116– 121

work page 2023
[28]

Automated gameplay testing and validation with curiosity-conditioned proximal trajectories,

A. Sestini, L. Gissl ´en, J. Bergdahl, K. Tollmar, and A. D. Bagdanov, “Automated gameplay testing and validation with curiosity-conditioned proximal trajectories,” IEEE Transactions on Games , vol. 16, no. 1, pp. 113–126, 2022

work page 2022
[29]

Meszaros, xUnit test patterns: Refactoring test code

G. Meszaros, xUnit test patterns: Refactoring test code . Pearson Education, 2007

work page 2007
[30]

Gpt-4o: Openai platform documentation,

OpenAI, “Gpt-4o: Openai platform documentation,” 2024, accessed: 2024-09-30. [Online]. Available: https://platform.openai.com/docs/ models/gpt-4o

work page 2024
[31]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778

work page 2016
[32]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017

work page 2017
[33]

An agent-based approach to automated game testing: an experience re- port,

I. Prasetya, F. Pastor Ric ´os, F. M. Kifetew, D. Prandi, S. Shirzadehha- jimahmood, T. E. V os, P. Paska, K. Hovorka, R. Ferdous, A. Susi et al., “An agent-based approach to automated game testing: an experience re- port,” in Proceedings of the 13th International Workshop on Automating Test Case Design, Selection and Evaluation , 2022, pp. 1–8

work page 2022
[34]

Glib: towards automated test oracle for graphically-rich applications,

K. Chen, Y . Li, Y . Chen, C. Fan, Z. Hu, and W. Yang, “Glib: towards automated test oracle for graphically-rich applications,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1093–1104

work page 2021
[35]

Norman, The design of everyday things: Revised and expanded edition

D. Norman, The design of everyday things: Revised and expanded edition. Basic books, 2013

work page 2013
[36]

Shneiderman, C

B. Shneiderman, C. Plaisant, M. S. Cohen, S. Jacobs, N. Elmqvist, and N. Diakopoulos, Designing the User Interface: Strategies for Effective Human-Computer Interaction. Pearson, 2016

work page 2016
[37]

Dix, Human-computer interaction

A. Dix, Human-computer interaction. Pearson Education, 2004

work page 2004
[38]

Hartson and P

R. Hartson and P. S. Pyla, The UX book: Agile UX design for a quality user experience. Morgan Kaufmann, 2018

work page 2018
[39]

Lazar, J

J. Lazar, J. H. Feng, and H. Hochheiser, Research methods in human- computer interaction. Morgan Kaufmann, 2017

work page 2017
[40]

Uncalibrated models can improve human-ai collaboration,

K. V odrahalli, T. Gerstenberg, and J. Y . Zou, “Uncalibrated models can improve human-ai collaboration,” Advances in Neural Information Processing Systems, vol. 35, pp. 4004–4016, 2022

work page 2022
[41]

Human and machine: The impact of machine input on decision making under cognitive limitations,

T. Boyacı, C. Canyakmaz, and F. de V ´ericourt, “Human and machine: The impact of machine input on decision making under cognitive limitations,” Management Science, vol. 70, no. 2, pp. 1258–1275, 2024

work page 2024
[42]

A review on human–machine trust evaluation: Human-centric and machine-centric perspectives,

B. Gebru, L. Zeleke, D. Blankson, M. Nabil, S. Nateghi, A. Homai- far, and E. Tunstel, “A review on human–machine trust evaluation: Human-centric and machine-centric perspectives,” IEEE Transactions on Human-Machine Systems , vol. 52, no. 5, pp. 952–962, 2022

work page 2022
[43]

Dissonance between hu- man and machine understanding,

Z. Zhang, J. Singh, U. Gadiraju, and A. Anand, “Dissonance between hu- man and machine understanding,” Proceedings of the ACM on Human- Computer Interaction, vol. 3, no. CSCW, pp. 1–23, 2019

work page 2019
[44]

Automated usability evaluation of virtual reality applica- tions,

P. Harms, “Automated usability evaluation of virtual reality applica- tions,” ACM Transactions on Computer-Human Interaction (TOCHI) , vol. 26, no. 3, pp. 1–36, 2019

work page 2019
[45]

Automated bug finding in video games: A case study for runtime monitoring,

S. Varvaressos, K. Lavoie, S. Gaboury, and S. Hall ´e, “Automated bug finding in video games: A case study for runtime monitoring,” Computers in Entertainment (CIE) , vol. 15, no. 1, pp. 1–28, 2017

work page 2017
[46]

A deep reinforce- ment learning technique for bug detection in video games,

G. Rani, U. Pandey, A. A. Wagde, and V . S. Dhaka, “A deep reinforce- ment learning technique for bug detection in video games,” International Journal of Information Technology , vol. 15, no. 1, pp. 355–367, 2023

work page 2023
[47]

Video game automated testing ap- proaches: An assessment framework,

A. Albaghajati and M. Ahmed, “Video game automated testing ap- proaches: An assessment framework,” IEEE transactions on games , vol. 15, no. 1, pp. 81–94, 2020

work page 2020
[48]

A framework for the semi-automatic testing of video games,

A. Nantes, R. Brown, and F. Maire, “A framework for the semi-automatic testing of video games,” in Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , vol. 4, no. 1, 2008, pp. 197–202

work page 2008
[49]

Rivergame-a game testing tool using artificial intelligence,

C. Paduraru, M. Paduraru, and A. Stefanescu, “Rivergame-a game testing tool using artificial intelligence,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 422–432

work page 2022
[50]

The basic needs in games scale (bangs): A new tool for investigating positive and negative video game experiences,

N. Ballou, A. Denisova, R. Ryan, C. S. Rigby, and S. Deterding, “The basic needs in games scale (bangs): A new tool for investigating positive and negative video game experiences,” International Journal of Human- Computer Studies, vol. 188, p. 103289, 2024

work page 2024
[51]

C. P. Schultz and R. D. Bryant, Game testing: All in one . Mercury Learning and Information, 2016

work page 2016
[52]

Know your bugs: A survey of issues in automated game testing literature,

R. Coppola, T. Fulcini, and F. Strada, “Know your bugs: A survey of issues in automated game testing literature,” in 2024 IEEE Gaming, Entertainment, and Media Conference (GEM) . IEEE, 2024, pp. 1–6

work page 2024
[53]

Droidgamer: Android game testing with operable widget recognition by deep learning,

B. Jiang, W. Wei, L. Yi, and W. Chan, “Droidgamer: Android game testing with operable widget recognition by deep learning,” in 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2021, pp. 197–206

work page 2021
[54]

Andrejevic, Infoglut: How too much information is changing the way we think and know

M. Andrejevic, Infoglut: How too much information is changing the way we think and know . Routledge, 2013

work page 2013
[55]

Docbench: A benchmark for evaluating llm-based document reading systems,

A. Zou, W. Yu, H. Zhang, K. Ma, D. Cai, Z. Zhang, H. Zhao, and D. Yu, “Docbench: A benchmark for evaluating llm-based document reading systems,” arXiv preprint arXiv:2407.10701 , 2024

work page arXiv 2024

[1] [1]

Levy and J

L. Levy and J. Novak, Game development essentials: Game QA & testing. Delmar Learning, 2009

work page 2009

[2] [2]

Video game values: Human–computer interaction and games,

P. Barr, J. Noble, and R. Biddle, “Video game values: Human–computer interaction and games,” Interacting with computers , vol. 19, no. 2, pp. 180–195, 2007

work page 2007

[3] [3]

Ag3: Automated game gui text glitch detection based on computer vision,

X. Liang, J. Qi, Y . Gao, C. Peng, and P. Yang, “Ag3: Automated game gui text glitch detection based on computer vision,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2023, pp. 1879–1890

work page 2023

[4] [4]

The consolidation of game software engineering: A systematic literature review of soft- ware engineering for industry-scale computer games,

J. Chueca, J. Ver ´on, J. Font, F. P ´erez, and C. Cetina, “The consolidation of game software engineering: A systematic literature review of soft- ware engineering for industry-scale computer games,” Information and Software Technology, vol. 165, p. 107330, 2024

work page 2024

[5] [5]

A survey of video game testing,

C. Politowski, F. Petrillo, and Y .-G. Gu´eh´eneuc, “A survey of video game testing,” in 2021 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 2021, pp. 90–99

work page 2021

[6] [6]

Automated game testing with icarus: Intelligent completion of adventure riddles via unsupervised solving,

J. Pfau, J. D. Smeddinck, and R. Malaka, “Automated game testing with icarus: Intelligent completion of adventure riddles via unsupervised solving,” in Extended abstracts publication of the annual symposium on computer-human interaction in play , 2017, pp. 153–164

work page 2017

[7] [7]

Quantizing large- language models for predicting flaky tests,

S. Rahman, A. Baz, S. Misailovic, and A. Shi, “Quantizing large- language models for predicting flaky tests,” in 2024 IEEE Conference on Software Testing, Verification and Validation (ICST) . IEEE, 2024, pp. 93–104

work page 2024

[8] [8]

A new approach in development of distributed framework for automated software testing using agents,

P. Dhavachelvan, G. Uma, and V . Venkatachalapathy, “A new approach in development of distributed framework for automated software testing using agents,” Knowledge-Based Systems, vol. 19, no. 4, pp. 235–247, 2006

work page 2006

[9] [9]

Development of game testing method for measuring game quality,

R. Ramadan and B. Hendradjaya, “Development of game testing method for measuring game quality,” in 2014 International Conference on Data and Software Engineering (ICODSE) . IEEE, 2014, pp. 1–6

work page 2014

[10] [10]

A video game testing method utilizing deep learning,

M. R. Taesiri, M. Habibi, and M. A. Fazli, “A video game testing method utilizing deep learning,” Iran Journal of Computer Science , vol. 17, no. 2, 2020

work page 2020

[11] [11]

Machine learning applied to software testing: A systematic mapping study,

V . H. Durelli, R. S. Durelli, S. S. Borges, A. T. Endo, M. M. Eler, D. R. Dias, and M. P. Guimar˜aes, “Machine learning applied to software testing: A systematic mapping study,” IEEE Transactions on Reliability, vol. 68, no. 3, pp. 1189–1212, 2019

work page 2019

[12] [12]

Automated game testing using computer vision methods,

C. Paduraru, M. Paduraru, and A. Stefanescu, “Automated game testing using computer vision methods,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) . IEEE, 2021, pp. 65–72

work page 2021

[13] [13]

Astrobug: Automatic game bug detection using deep learning,

E. Azizi and L. Zaman, “Astrobug: Automatic game bug detection using deep learning,” IEEE Transactions on Games , 2024

work page 2024

[14] [14]

Automated evaluation of game content display using deep learning,

C. Paduraru, M. Cernat, and A. Stefanescu, “Automated evaluation of game content display using deep learning,” in Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 2024, pp. 421–424

work page 2024

[15] [15]

Inspector: Pixel-based automated game testing via exploration, detection, and investigation,

G. Liu, M. Cai, L. Zhao, T. Qin, A. Brown, J. Bischoff, and T.-Y . Liu, “Inspector: Pixel-based automated game testing via exploration, detection, and investigation,” in 2022 IEEE Conference on Games (CoG). IEEE, 2022, pp. 237–244

work page 2022

[16] [16]

Supernova: Automating test selection and defect prevention in aaa video games using risk based testing and machine learning,

A. Senchenko, N. Patterson, H. Samuel, and D. Ispir, “Supernova: Automating test selection and defect prevention in aaa video games using risk based testing and machine learning,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 345–354

work page 2022

[17] [17]

Restful api automated test case generation,

A. Arcuri, “Restful api automated test case generation,” in 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2017, pp. 9–20

work page 2017

[18] [18]

Automated video game testing using synthetic and humanlike agents,

S. Ariyurek, A. Betin-Can, and E. Surer, “Automated video game testing using synthetic and humanlike agents,” IEEE Transactions on Games , vol. 13, no. 1, pp. 50–67, 2019

work page 2019

[19] [19]

Artificial intelligence (ai): Multidisciplinary perspectives on emerging challenges, opportuni- ties, and agenda for research, practice and policy,

Y . K. Dwivedi, L. Hughes, E. Ismagilova, G. Aarts, C. Coombs, T. Crick, Y . Duan, R. Dwivedi, J. Edwards, A. Eiruget al., “Artificial intelligence (ai): Multidisciplinary perspectives on emerging challenges, opportuni- ties, and agenda for research, practice and policy,” International journal of information management , vol. 57, p. 101994, 2021

work page 2021

[20] [20]

Explanations con- sidered harmful: The impact of misleading explanations on accuracy in hybrid human-ai decision making,

F. Cabitza, C. Fregosi, A. Campagner, and C. Natali, “Explanations con- sidered harmful: The impact of misleading explanations on accuracy in hybrid human-ai decision making,” in World Conference on Explainable Artificial Intelligence. Springer, 2024, pp. 255–269

work page 2024

[21] [21]

From human-computer interaction to human-ai interaction: new challenges and opportunities for enabling human-centered ai,

W. Xu, M. J. Dainoff, L. Ge, and Z. Gao, “From human-computer interaction to human-ai interaction: new challenges and opportunities for enabling human-centered ai,” arXiv preprint arXiv:2105.05424 , vol. 5, 2021

work page arXiv 2021

[22] [22]

A Survey of Hallucination in Large Foundation Models

V . Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Y . Zhang, Y . Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y . Zhang, Y . Chenet al., “Siren’s song in the ai ocean: a survey on hal- lucination in large language models,” arXiv preprint arXiv:2309.01219, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Contrastive learning reduces hallucination in conversations,

W. Sun, Z. Shi, S. Gao, P. Ren, M. de Rijke, and Z. Ren, “Contrastive learning reduces hallucination in conversations,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 11, 2023, pp. 13 618–13 626

work page 2023

[25] [25]

Honest ai: Fine- tuning

X. Chen, L. Wang, W. Wu, Q. Tang, and Y . Liu, “Honest ai: Fine- tuning” small” language models to say” i don’t know”, and reducing hallucination in rag,” in 2024 KDD Cup Workshop for Retrieval Aug- mented Generation, 2024

work page 2024

[26] [26]

Usertesting without the user: Opportunities and challenges of an ai-driven approach in games user research,

S. N. Stahlke and P. Mirza-Babaei, “Usertesting without the user: Opportunities and challenges of an ai-driven approach in games user research,” Computers in Entertainment (CIE) , vol. 16, no. 2, pp. 1–18, 2018

work page 2018

[27] [27]

Visual glitches classification for video game using deep learning-based techniques,

S. Jintawatsakoon, “Visual glitches classification for video game using deep learning-based techniques,” in 2023 27th International Computer Science and Engineering Conference (ICSEC) . IEEE, 2023, pp. 116– 121

work page 2023

[28] [28]

Automated gameplay testing and validation with curiosity-conditioned proximal trajectories,

A. Sestini, L. Gissl ´en, J. Bergdahl, K. Tollmar, and A. D. Bagdanov, “Automated gameplay testing and validation with curiosity-conditioned proximal trajectories,” IEEE Transactions on Games , vol. 16, no. 1, pp. 113–126, 2022

work page 2022

[29] [29]

Meszaros, xUnit test patterns: Refactoring test code

G. Meszaros, xUnit test patterns: Refactoring test code . Pearson Education, 2007

work page 2007

[30] [30]

Gpt-4o: Openai platform documentation,

OpenAI, “Gpt-4o: Openai platform documentation,” 2024, accessed: 2024-09-30. [Online]. Available: https://platform.openai.com/docs/ models/gpt-4o

work page 2024

[31] [31]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778

work page 2016

[32] [32]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017

work page 2017

[33] [33]

An agent-based approach to automated game testing: an experience re- port,

I. Prasetya, F. Pastor Ric ´os, F. M. Kifetew, D. Prandi, S. Shirzadehha- jimahmood, T. E. V os, P. Paska, K. Hovorka, R. Ferdous, A. Susi et al., “An agent-based approach to automated game testing: an experience re- port,” in Proceedings of the 13th International Workshop on Automating Test Case Design, Selection and Evaluation , 2022, pp. 1–8

work page 2022

[34] [34]

Glib: towards automated test oracle for graphically-rich applications,

K. Chen, Y . Li, Y . Chen, C. Fan, Z. Hu, and W. Yang, “Glib: towards automated test oracle for graphically-rich applications,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1093–1104

work page 2021

[35] [35]

Norman, The design of everyday things: Revised and expanded edition

D. Norman, The design of everyday things: Revised and expanded edition. Basic books, 2013

work page 2013

[36] [36]

Shneiderman, C

B. Shneiderman, C. Plaisant, M. S. Cohen, S. Jacobs, N. Elmqvist, and N. Diakopoulos, Designing the User Interface: Strategies for Effective Human-Computer Interaction. Pearson, 2016

work page 2016

[37] [37]

Dix, Human-computer interaction

A. Dix, Human-computer interaction. Pearson Education, 2004

work page 2004

[38] [38]

Hartson and P

R. Hartson and P. S. Pyla, The UX book: Agile UX design for a quality user experience. Morgan Kaufmann, 2018

work page 2018

[39] [39]

Lazar, J

J. Lazar, J. H. Feng, and H. Hochheiser, Research methods in human- computer interaction. Morgan Kaufmann, 2017

work page 2017

[40] [40]

Uncalibrated models can improve human-ai collaboration,

K. V odrahalli, T. Gerstenberg, and J. Y . Zou, “Uncalibrated models can improve human-ai collaboration,” Advances in Neural Information Processing Systems, vol. 35, pp. 4004–4016, 2022

work page 2022

[41] [41]

Human and machine: The impact of machine input on decision making under cognitive limitations,

T. Boyacı, C. Canyakmaz, and F. de V ´ericourt, “Human and machine: The impact of machine input on decision making under cognitive limitations,” Management Science, vol. 70, no. 2, pp. 1258–1275, 2024

work page 2024

[42] [42]

A review on human–machine trust evaluation: Human-centric and machine-centric perspectives,

B. Gebru, L. Zeleke, D. Blankson, M. Nabil, S. Nateghi, A. Homai- far, and E. Tunstel, “A review on human–machine trust evaluation: Human-centric and machine-centric perspectives,” IEEE Transactions on Human-Machine Systems , vol. 52, no. 5, pp. 952–962, 2022

work page 2022

[43] [43]

Dissonance between hu- man and machine understanding,

Z. Zhang, J. Singh, U. Gadiraju, and A. Anand, “Dissonance between hu- man and machine understanding,” Proceedings of the ACM on Human- Computer Interaction, vol. 3, no. CSCW, pp. 1–23, 2019

work page 2019

[44] [44]

Automated usability evaluation of virtual reality applica- tions,

P. Harms, “Automated usability evaluation of virtual reality applica- tions,” ACM Transactions on Computer-Human Interaction (TOCHI) , vol. 26, no. 3, pp. 1–36, 2019

work page 2019

[45] [45]

Automated bug finding in video games: A case study for runtime monitoring,

S. Varvaressos, K. Lavoie, S. Gaboury, and S. Hall ´e, “Automated bug finding in video games: A case study for runtime monitoring,” Computers in Entertainment (CIE) , vol. 15, no. 1, pp. 1–28, 2017

work page 2017

[46] [46]

A deep reinforce- ment learning technique for bug detection in video games,

G. Rani, U. Pandey, A. A. Wagde, and V . S. Dhaka, “A deep reinforce- ment learning technique for bug detection in video games,” International Journal of Information Technology , vol. 15, no. 1, pp. 355–367, 2023

work page 2023

[47] [47]

Video game automated testing ap- proaches: An assessment framework,

A. Albaghajati and M. Ahmed, “Video game automated testing ap- proaches: An assessment framework,” IEEE transactions on games , vol. 15, no. 1, pp. 81–94, 2020

work page 2020

[48] [48]

A framework for the semi-automatic testing of video games,

A. Nantes, R. Brown, and F. Maire, “A framework for the semi-automatic testing of video games,” in Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , vol. 4, no. 1, 2008, pp. 197–202

work page 2008

[49] [49]

Rivergame-a game testing tool using artificial intelligence,

C. Paduraru, M. Paduraru, and A. Stefanescu, “Rivergame-a game testing tool using artificial intelligence,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 422–432

work page 2022

[50] [50]

The basic needs in games scale (bangs): A new tool for investigating positive and negative video game experiences,

N. Ballou, A. Denisova, R. Ryan, C. S. Rigby, and S. Deterding, “The basic needs in games scale (bangs): A new tool for investigating positive and negative video game experiences,” International Journal of Human- Computer Studies, vol. 188, p. 103289, 2024

work page 2024

[51] [51]

C. P. Schultz and R. D. Bryant, Game testing: All in one . Mercury Learning and Information, 2016

work page 2016

[52] [52]

Know your bugs: A survey of issues in automated game testing literature,

R. Coppola, T. Fulcini, and F. Strada, “Know your bugs: A survey of issues in automated game testing literature,” in 2024 IEEE Gaming, Entertainment, and Media Conference (GEM) . IEEE, 2024, pp. 1–6

work page 2024

[53] [53]

Droidgamer: Android game testing with operable widget recognition by deep learning,

B. Jiang, W. Wei, L. Yi, and W. Chan, “Droidgamer: Android game testing with operable widget recognition by deep learning,” in 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2021, pp. 197–206

work page 2021

[54] [54]

Andrejevic, Infoglut: How too much information is changing the way we think and know

M. Andrejevic, Infoglut: How too much information is changing the way we think and know . Routledge, 2013

work page 2013

[55] [55]

Docbench: A benchmark for evaluating llm-based document reading systems,

A. Zou, W. Yu, H. Zhang, K. Ma, D. Cai, Z. Zhang, H. Zhao, and D. Yu, “Docbench: A benchmark for evaluating llm-based document reading systems,” arXiv preprint arXiv:2407.10701 , 2024

work page arXiv 2024