Human-AI Collaborative Game Testing with Vision Language Models
Pith reviewed 2026-05-23 05:03 UTC · model grok-4.3
The pith
AI assistance with vision language models significantly improves human defect identification in game testing, especially when paired with detailed knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors develop an AI-assisted workflow for game testing that uses vision language models to detect defects. Through controlled experiments involving 800 test cases and 276 participants of varying backgrounds, they compare four conditions varying the presence of AI support and detailed defect/design knowledge. The results show that AI assistance significantly improves defect identification performance, particularly when paired with detailed knowledge. Challenges arise when AI errors occur, negatively impacting human decision-making. The study concludes that optimizing human-AI collaboration and mitigating AI inaccuracies is important for enhancing efficiency and accuracy in game testing.
What carries the argument
The AI-assisted workflow leveraging vision language models for defect detection, evaluated through a four-condition experiment on human testers.
If this is right
- Human testers identify more defects when given AI assistance than without it.
- The largest performance gains occur when AI support is combined with detailed defect and design knowledge.
- AI errors can reduce the accuracy of human decisions in defect identification.
- Strategies to mitigate the effects of AI inaccuracies are required for reliable human-AI collaboration.
- AI integration can enhance efficiency and accuracy in game testing workflows.
Where Pith is reading between the lines
- The same workflow could be tested in non-game software testing where visual elements are inspected for defects.
- Adding AI confidence scores or override prompts to the interface might reduce the negative impact of AI mistakes on humans.
- Companies using this approach would likely need new training protocols so testers learn to question AI outputs.
- Providing design documentation alongside AI tools may become a standard practice in collaborative testing setups.
Load-bearing premise
The 800 test cases and four experimental conditions sufficiently isolate the effect of AI assistance without major confounding from participant skill variation or test case selection bias.
What would settle it
A replication experiment using a new participant group and different test cases that finds no significant improvement in defect identification rates from AI assistance would disprove the central claim.
Figures
read the original abstract
As modern video games become increasingly complex, traditional manual testing methods are proving costly and inefficient, limiting the ability to ensure high-quality game experiences. While advancements in Artificial Intelligence (AI) offer the potential to assist human testers, the effectiveness of AI in truly enhancing real-world human performance remains underexplored. This study investigates how AI can improve game testing by developing and experimenting with an AI-assisted workflow that leverages state-of-the-art machine learning models for defect detection. Through an experiment involving 800 test cases and 276 participants of varying backgrounds, we evaluate the effectiveness of AI assistance under four conditions: with or without AI support, and with or without detailed knowledge of defects and design documentation. The results indicate that AI assistance significantly improves defect identification performance, particularly when paired with detailed knowledge. However, challenges arise when AI errors occur, negatively impacting human decision-making. Our findings show the importance of optimizing human-AI collaboration and implementing strategies to mitigate the effects of AI inaccuracies. By this research, we demonstrate AI's potential and problems in enhancing efficiency and accuracy in game testing workflows and offers practical insights for integrating AI into the testing process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes an empirical user study on human-AI collaboration for video game defect detection using vision language models. It reports a 2×2 between-subjects experiment with 276 participants of varying backgrounds and 800 test cases across four conditions (AI support present/absent crossed with detailed defect/design knowledge present/absent). The central claim is that AI assistance significantly improves defect identification performance, especially when paired with detailed knowledge, while also noting that AI errors can negatively affect human decisions.
Significance. If the reported performance gains prove robust after proper statistical controls, the work would offer concrete, domain-specific evidence on the benefits and risks of human-AI collaboration in game testing—an area of growing practical importance in HCI and software quality assurance. The inclusion of an AI-error condition and discussion of mitigation strategies is a strength that moves beyond purely positive framing.
major comments (3)
- [Abstract] Abstract: the headline claim that 'AI assistance significantly improves defect identification performance' is presented without any statistical details (p-values, effect sizes, confidence intervals, error bars, or power analysis), rendering the central empirical result unverifiable from the provided information.
- [Abstract] Abstract (experiment description): the 2×2 design with 276 participants and 800 test cases supplies no information on (a) randomization or balancing of participants across conditions, (b) measurement or stratification by pre-existing testing skill, or (c) sampling/stratification of the 800 test cases; these omissions directly threaten attribution of any observed lift to AI assistance rather than confounds.
- [Abstract] Abstract: the results mention 'challenges arise when AI errors occur' but provide no definition of defects, no breakdown of how AI errors were identified or quantified, and no analysis of their impact on the performance metrics, leaving the handling of the negative case unexamined.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on the abstract. We address each point below and will revise the abstract in the next version to improve clarity and verifiability while preserving conciseness.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that 'AI assistance significantly improves defect identification performance' is presented without any statistical details (p-values, effect sizes, confidence intervals, error bars, or power analysis), rendering the central empirical result unverifiable from the provided information.
Authors: We agree the abstract should include key statistical details. The full manuscript reports these in the Results section from the 2×2 experiment (including p-values, effect sizes, and confidence intervals from appropriate tests such as ANOVA or regression models). We will revise the abstract to add a concise summary of the main statistical findings supporting the headline claim. revision: yes
-
Referee: [Abstract] Abstract (experiment description): the 2×2 design with 276 participants and 800 test cases supplies no information on (a) randomization or balancing of participants across conditions, (b) measurement or stratification by pre-existing testing skill, or (c) sampling/stratification of the 800 test cases; these omissions directly threaten attribution of any observed lift to AI assistance rather than confounds.
Authors: The Methods section of the full manuscript details random assignment of participants to the four conditions with balancing on background variables, pre-experiment measurement of testing experience for stratification/control, and the sampling procedure for the 800 test cases (covering diverse defect categories). We acknowledge the abstract omits these controls. We will add a brief clause to the abstract summarizing the randomization, balancing, and sampling approach. revision: yes
-
Referee: [Abstract] Abstract: the results mention 'challenges arise when AI errors occur' but provide no definition of defects, no breakdown of how AI errors were identified or quantified, and no analysis of their impact on the performance metrics, leaving the handling of the negative case unexamined.
Authors: The full manuscript defines defects in the Methods, describes the AI-error condition (including how erroneous outputs were generated and quantified), and analyzes their impact on human decisions in the Results. We will revise the abstract to briefly note the defect definition and the analysis of AI-error effects. revision: yes
Circularity Check
No circularity: purely empirical user study with direct measurements
full rationale
The paper reports results from a 2×2 between-subjects experiment (276 participants, 800 test cases) comparing AI assistance and knowledge conditions on defect identification. No equations, derivations, fitted parameters, or predictive models are presented. Claims rest on observed performance differences across conditions, not on any self-referential construction or self-citation chain. The reader's assessment of 0.0 circularity is confirmed by the absence of any load-bearing mathematical or definitional steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected 800 test cases and 276 participants of varying backgrounds represent typical game testing scenarios and allow generalization of AI assistance effects.
Reference graph
Works this paper leans on
-
[1]
L. Levy and J. Novak, Game development essentials: Game QA & testing. Delmar Learning, 2009
work page 2009
-
[2]
Video game values: Human–computer interaction and games,
P. Barr, J. Noble, and R. Biddle, “Video game values: Human–computer interaction and games,” Interacting with computers , vol. 19, no. 2, pp. 180–195, 2007
work page 2007
-
[3]
Ag3: Automated game gui text glitch detection based on computer vision,
X. Liang, J. Qi, Y . Gao, C. Peng, and P. Yang, “Ag3: Automated game gui text glitch detection based on computer vision,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2023, pp. 1879–1890
work page 2023
-
[4]
J. Chueca, J. Ver ´on, J. Font, F. P ´erez, and C. Cetina, “The consolidation of game software engineering: A systematic literature review of soft- ware engineering for industry-scale computer games,” Information and Software Technology, vol. 165, p. 107330, 2024
work page 2024
-
[5]
A survey of video game testing,
C. Politowski, F. Petrillo, and Y .-G. Gu´eh´eneuc, “A survey of video game testing,” in 2021 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 2021, pp. 90–99
work page 2021
-
[6]
J. Pfau, J. D. Smeddinck, and R. Malaka, “Automated game testing with icarus: Intelligent completion of adventure riddles via unsupervised solving,” in Extended abstracts publication of the annual symposium on computer-human interaction in play , 2017, pp. 153–164
work page 2017
-
[7]
Quantizing large- language models for predicting flaky tests,
S. Rahman, A. Baz, S. Misailovic, and A. Shi, “Quantizing large- language models for predicting flaky tests,” in 2024 IEEE Conference on Software Testing, Verification and Validation (ICST) . IEEE, 2024, pp. 93–104
work page 2024
-
[8]
A new approach in development of distributed framework for automated software testing using agents,
P. Dhavachelvan, G. Uma, and V . Venkatachalapathy, “A new approach in development of distributed framework for automated software testing using agents,” Knowledge-Based Systems, vol. 19, no. 4, pp. 235–247, 2006
work page 2006
-
[9]
Development of game testing method for measuring game quality,
R. Ramadan and B. Hendradjaya, “Development of game testing method for measuring game quality,” in 2014 International Conference on Data and Software Engineering (ICODSE) . IEEE, 2014, pp. 1–6
work page 2014
-
[10]
A video game testing method utilizing deep learning,
M. R. Taesiri, M. Habibi, and M. A. Fazli, “A video game testing method utilizing deep learning,” Iran Journal of Computer Science , vol. 17, no. 2, 2020
work page 2020
-
[11]
Machine learning applied to software testing: A systematic mapping study,
V . H. Durelli, R. S. Durelli, S. S. Borges, A. T. Endo, M. M. Eler, D. R. Dias, and M. P. Guimar˜aes, “Machine learning applied to software testing: A systematic mapping study,” IEEE Transactions on Reliability, vol. 68, no. 3, pp. 1189–1212, 2019
work page 2019
-
[12]
Automated game testing using computer vision methods,
C. Paduraru, M. Paduraru, and A. Stefanescu, “Automated game testing using computer vision methods,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) . IEEE, 2021, pp. 65–72
work page 2021
-
[13]
Astrobug: Automatic game bug detection using deep learning,
E. Azizi and L. Zaman, “Astrobug: Automatic game bug detection using deep learning,” IEEE Transactions on Games , 2024
work page 2024
-
[14]
Automated evaluation of game content display using deep learning,
C. Paduraru, M. Cernat, and A. Stefanescu, “Automated evaluation of game content display using deep learning,” in Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 2024, pp. 421–424
work page 2024
-
[15]
Inspector: Pixel-based automated game testing via exploration, detection, and investigation,
G. Liu, M. Cai, L. Zhao, T. Qin, A. Brown, J. Bischoff, and T.-Y . Liu, “Inspector: Pixel-based automated game testing via exploration, detection, and investigation,” in 2022 IEEE Conference on Games (CoG). IEEE, 2022, pp. 237–244
work page 2022
-
[16]
A. Senchenko, N. Patterson, H. Samuel, and D. Ispir, “Supernova: Automating test selection and defect prevention in aaa video games using risk based testing and machine learning,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 345–354
work page 2022
-
[17]
Restful api automated test case generation,
A. Arcuri, “Restful api automated test case generation,” in 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2017, pp. 9–20
work page 2017
-
[18]
Automated video game testing using synthetic and humanlike agents,
S. Ariyurek, A. Betin-Can, and E. Surer, “Automated video game testing using synthetic and humanlike agents,” IEEE Transactions on Games , vol. 13, no. 1, pp. 50–67, 2019
work page 2019
-
[19]
Y . K. Dwivedi, L. Hughes, E. Ismagilova, G. Aarts, C. Coombs, T. Crick, Y . Duan, R. Dwivedi, J. Edwards, A. Eiruget al., “Artificial intelligence (ai): Multidisciplinary perspectives on emerging challenges, opportuni- ties, and agenda for research, practice and policy,” International journal of information management , vol. 57, p. 101994, 2021
work page 2021
-
[20]
F. Cabitza, C. Fregosi, A. Campagner, and C. Natali, “Explanations con- sidered harmful: The impact of misleading explanations on accuracy in hybrid human-ai decision making,” in World Conference on Explainable Artificial Intelligence. Springer, 2024, pp. 255–269
work page 2024
-
[21]
W. Xu, M. J. Dainoff, L. Ge, and Z. Gao, “From human-computer interaction to human-ai interaction: new challenges and opportunities for enabling human-centered ai,” arXiv preprint arXiv:2105.05424 , vol. 5, 2021
-
[22]
A Survey of Hallucination in Large Foundation Models
V . Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Y . Zhang, Y . Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y . Zhang, Y . Chenet al., “Siren’s song in the ai ocean: a survey on hal- lucination in large language models,” arXiv preprint arXiv:2309.01219, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Contrastive learning reduces hallucination in conversations,
W. Sun, Z. Shi, S. Gao, P. Ren, M. de Rijke, and Z. Ren, “Contrastive learning reduces hallucination in conversations,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 11, 2023, pp. 13 618–13 626
work page 2023
-
[25]
X. Chen, L. Wang, W. Wu, Q. Tang, and Y . Liu, “Honest ai: Fine- tuning” small” language models to say” i don’t know”, and reducing hallucination in rag,” in 2024 KDD Cup Workshop for Retrieval Aug- mented Generation, 2024
work page 2024
-
[26]
S. N. Stahlke and P. Mirza-Babaei, “Usertesting without the user: Opportunities and challenges of an ai-driven approach in games user research,” Computers in Entertainment (CIE) , vol. 16, no. 2, pp. 1–18, 2018
work page 2018
-
[27]
Visual glitches classification for video game using deep learning-based techniques,
S. Jintawatsakoon, “Visual glitches classification for video game using deep learning-based techniques,” in 2023 27th International Computer Science and Engineering Conference (ICSEC) . IEEE, 2023, pp. 116– 121
work page 2023
-
[28]
Automated gameplay testing and validation with curiosity-conditioned proximal trajectories,
A. Sestini, L. Gissl ´en, J. Bergdahl, K. Tollmar, and A. D. Bagdanov, “Automated gameplay testing and validation with curiosity-conditioned proximal trajectories,” IEEE Transactions on Games , vol. 16, no. 1, pp. 113–126, 2022
work page 2022
-
[29]
Meszaros, xUnit test patterns: Refactoring test code
G. Meszaros, xUnit test patterns: Refactoring test code . Pearson Education, 2007
work page 2007
-
[30]
Gpt-4o: Openai platform documentation,
OpenAI, “Gpt-4o: Openai platform documentation,” 2024, accessed: 2024-09-30. [Online]. Available: https://platform.openai.com/docs/ models/gpt-4o
work page 2024
-
[31]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778
work page 2016
-
[32]
Imagenet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017
work page 2017
-
[33]
An agent-based approach to automated game testing: an experience re- port,
I. Prasetya, F. Pastor Ric ´os, F. M. Kifetew, D. Prandi, S. Shirzadehha- jimahmood, T. E. V os, P. Paska, K. Hovorka, R. Ferdous, A. Susi et al., “An agent-based approach to automated game testing: an experience re- port,” in Proceedings of the 13th International Workshop on Automating Test Case Design, Selection and Evaluation , 2022, pp. 1–8
work page 2022
-
[34]
Glib: towards automated test oracle for graphically-rich applications,
K. Chen, Y . Li, Y . Chen, C. Fan, Z. Hu, and W. Yang, “Glib: towards automated test oracle for graphically-rich applications,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1093–1104
work page 2021
-
[35]
Norman, The design of everyday things: Revised and expanded edition
D. Norman, The design of everyday things: Revised and expanded edition. Basic books, 2013
work page 2013
-
[36]
B. Shneiderman, C. Plaisant, M. S. Cohen, S. Jacobs, N. Elmqvist, and N. Diakopoulos, Designing the User Interface: Strategies for Effective Human-Computer Interaction. Pearson, 2016
work page 2016
-
[37]
Dix, Human-computer interaction
A. Dix, Human-computer interaction. Pearson Education, 2004
work page 2004
-
[38]
R. Hartson and P. S. Pyla, The UX book: Agile UX design for a quality user experience. Morgan Kaufmann, 2018
work page 2018
- [39]
-
[40]
Uncalibrated models can improve human-ai collaboration,
K. V odrahalli, T. Gerstenberg, and J. Y . Zou, “Uncalibrated models can improve human-ai collaboration,” Advances in Neural Information Processing Systems, vol. 35, pp. 4004–4016, 2022
work page 2022
-
[41]
Human and machine: The impact of machine input on decision making under cognitive limitations,
T. Boyacı, C. Canyakmaz, and F. de V ´ericourt, “Human and machine: The impact of machine input on decision making under cognitive limitations,” Management Science, vol. 70, no. 2, pp. 1258–1275, 2024
work page 2024
-
[42]
A review on human–machine trust evaluation: Human-centric and machine-centric perspectives,
B. Gebru, L. Zeleke, D. Blankson, M. Nabil, S. Nateghi, A. Homai- far, and E. Tunstel, “A review on human–machine trust evaluation: Human-centric and machine-centric perspectives,” IEEE Transactions on Human-Machine Systems , vol. 52, no. 5, pp. 952–962, 2022
work page 2022
-
[43]
Dissonance between hu- man and machine understanding,
Z. Zhang, J. Singh, U. Gadiraju, and A. Anand, “Dissonance between hu- man and machine understanding,” Proceedings of the ACM on Human- Computer Interaction, vol. 3, no. CSCW, pp. 1–23, 2019
work page 2019
-
[44]
Automated usability evaluation of virtual reality applica- tions,
P. Harms, “Automated usability evaluation of virtual reality applica- tions,” ACM Transactions on Computer-Human Interaction (TOCHI) , vol. 26, no. 3, pp. 1–36, 2019
work page 2019
-
[45]
Automated bug finding in video games: A case study for runtime monitoring,
S. Varvaressos, K. Lavoie, S. Gaboury, and S. Hall ´e, “Automated bug finding in video games: A case study for runtime monitoring,” Computers in Entertainment (CIE) , vol. 15, no. 1, pp. 1–28, 2017
work page 2017
-
[46]
A deep reinforce- ment learning technique for bug detection in video games,
G. Rani, U. Pandey, A. A. Wagde, and V . S. Dhaka, “A deep reinforce- ment learning technique for bug detection in video games,” International Journal of Information Technology , vol. 15, no. 1, pp. 355–367, 2023
work page 2023
-
[47]
Video game automated testing ap- proaches: An assessment framework,
A. Albaghajati and M. Ahmed, “Video game automated testing ap- proaches: An assessment framework,” IEEE transactions on games , vol. 15, no. 1, pp. 81–94, 2020
work page 2020
-
[48]
A framework for the semi-automatic testing of video games,
A. Nantes, R. Brown, and F. Maire, “A framework for the semi-automatic testing of video games,” in Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , vol. 4, no. 1, 2008, pp. 197–202
work page 2008
-
[49]
Rivergame-a game testing tool using artificial intelligence,
C. Paduraru, M. Paduraru, and A. Stefanescu, “Rivergame-a game testing tool using artificial intelligence,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 422–432
work page 2022
-
[50]
N. Ballou, A. Denisova, R. Ryan, C. S. Rigby, and S. Deterding, “The basic needs in games scale (bangs): A new tool for investigating positive and negative video game experiences,” International Journal of Human- Computer Studies, vol. 188, p. 103289, 2024
work page 2024
-
[51]
C. P. Schultz and R. D. Bryant, Game testing: All in one . Mercury Learning and Information, 2016
work page 2016
-
[52]
Know your bugs: A survey of issues in automated game testing literature,
R. Coppola, T. Fulcini, and F. Strada, “Know your bugs: A survey of issues in automated game testing literature,” in 2024 IEEE Gaming, Entertainment, and Media Conference (GEM) . IEEE, 2024, pp. 1–6
work page 2024
-
[53]
Droidgamer: Android game testing with operable widget recognition by deep learning,
B. Jiang, W. Wei, L. Yi, and W. Chan, “Droidgamer: Android game testing with operable widget recognition by deep learning,” in 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2021, pp. 197–206
work page 2021
-
[54]
Andrejevic, Infoglut: How too much information is changing the way we think and know
M. Andrejevic, Infoglut: How too much information is changing the way we think and know . Routledge, 2013
work page 2013
-
[55]
Docbench: A benchmark for evaluating llm-based document reading systems,
A. Zou, W. Yu, H. Zhang, K. Ma, D. Cai, Z. Zhang, H. Zhao, and D. Yu, “Docbench: A benchmark for evaluating llm-based document reading systems,” arXiv preprint arXiv:2407.10701 , 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.