TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

Adri\'an Barahona-R\'ios; Ashley Wiens; Benedict Wilkins; Cor-Paul Bezemer; Nabajeet Barman; Saman Zadtootaghaj; Yakun Yu

arxiv: 2605.21443 · v1 · pith:LM4CGS7Dnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

Yakun Yu , Ashley Wiens , Adri\'an Barahona-R\'ios , Benedict Wilkins , Saman Zadtootaghaj , Nabajeet Barman , Cor-Paul Bezemer This is my paper

Pith reviewed 2026-05-21 04:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelstemporal glitch detectiongameplay videosvideo benchmarkquality assurancetemporal reasoning

0 comments

The pith

Vision-language models perform near chance when asked to spot temporal glitches that only appear across frames in gameplay videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing vision-language models handle glitches visible in one frame far better than those that require watching changes over time. It creates TempGlitch, a benchmark of gameplay videos containing five specific temporal glitch types along with matched clean videos, to measure this gap directly. Across twelve models and different frame-sampling rates, performance stays close to random, with models either missing most real glitches or incorrectly labeling clean footage. A reader would care because game studios and video platforms increasingly rely on these models for automated quality checks, yet the results indicate that current approaches cannot reliably catch time-dependent errors without additional human review.

Core claim

We introduce TempGlitch, a controlled benchmark of gameplay videos that includes five temporal glitch categories with balanced samples and paired glitch-free videos for binary evaluation. Testing twelve proprietary and open-weight vision-language models across multiple frame-sampling settings shows performance near chance level, with models collapsing into either overly conservative or overly sensitive responses. Neither denser frame sampling nor larger model size reliably corrects these failures.

What carries the argument

The TempGlitch benchmark, which supplies ordered video frames with temporal glitches and matched clean counterparts to enable precise measurement of vision-language model performance on change-based anomalies.

If this is right

Vision-language models need explicit mechanisms for tracking changes across ordered frames rather than treating video as independent images.
Automated gameplay quality assurance systems that use these models will likely require human oversight specifically for temporal issues.
Future training of video models should include more tasks focused on detecting anomalies that unfold over time.
Evaluations limited to single-frame anomalies will continue to overestimate model readiness for full video understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The persistent failure even with more frames suggests that simply adding temporal context is not enough without changes to how models integrate frame sequences.
The benchmark approach could be adapted to test temporal reasoning in other video domains such as surveillance or sports analysis.
Developers might explore combining vision-language models with separate motion-tracking components to catch glitches that pure visual models miss.

Load-bearing premise

The five chosen temporal glitch types represent the main kinds of time-based errors that appear in real gameplay videos, and the paired clean videos contain no hidden temporal anomalies.

What would settle it

If a new vision-language model or sampling method achieves accuracy well above the reported near-chance levels on the same TempGlitch videos while human raters confirm the labels, the claim that current models cannot handle temporal glitches would be weakened.

Figures

Figures reproduced from arXiv: 2605.21443 by Adri\'an Barahona-R\'ios, Ashley Wiens, Benedict Wilkins, Cor-Paul Bezemer, Nabajeet Barman, Saman Zadtootaghaj, Yakun Yu.

**Figure 1.** Figure 1: Spatial versus temporal glitches. The top part shows temporal glitches represented by several frames from one video: Blinking glitch refers to an object intermittently appearing and disappearing across frames; Shooting glitch indicates an artefact when the firing effect occurs at an incorrect position relative to the weapon and target; Velocity glitch are implausible changes in character motion velocity, w… view at source ↗

**Figure 2.** Figure 2: Prompt template used for constructing the spatial–temporal glitch subsets. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt template for video-based glitch detection. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of Blinking glitches. E Additional Performance Metrics [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of Shooting glitches [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of velocity glitches. Case 1 Case 2 Case 3 Case 4 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of frozen glitches. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of stuck-in-place glitches. 1 FPS 5 FPS [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Frame sampling rate affects the visibility of blinking. A brief blinking glitch occurs as the [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TempGlitch gives a new controlled benchmark for temporal glitches in game videos and shows VLMs near chance, but the results hinge on unshown details of how the data was built.

read the letter

The main point is that this paper introduces TempGlitch, a benchmark with five temporal glitch types and paired clean videos, then shows that 12 VLMs sit near chance on detecting them. Denser frame sampling and larger models do not fix the problem in any reliable way. Models tend to either miss most glitches or flag clean videos instead. That pattern is the core empirical finding. The work does a clean job separating temporal from spatial glitches and giving a focused testbed where prior evaluations mostly stayed with single-frame checks. Evaluating multiple models across sampling settings adds practical value, and releasing code and data makes it usable for follow-up work. The setup is straightforward and the negative result on scaling is worth noting for anyone working on video models for games. The soft spot is the lack of visible detail on benchmark construction. The claim that performance reflects VLM limits rather than data artifacts depends on the glitches being inserted in ways that match real gameplay issues and on the clean videos truly containing no temporal problems. The abstract mentions a controlled setup with balanced samples but does not describe the insertion mechanics or validation steps. If those choices introduce unnatural patterns, the conservative or sensitive collapse could be an artifact of the test rather than a general limit. Exact metrics, error bars, and the preliminary study results are also not shown here, which makes it harder to judge how solid the numbers are. This paper is aimed at people doing VLM evaluation for video or automated game QA. A reader building temporal reasoning tests would get a ready-made benchmark and a clear baseline to beat. It has enough new material and a broad enough evaluation to deserve peer review. Referees could usefully press on the data creation process and ask for fuller reporting of the results.

Referee Report

2 major / 2 minor

Summary. The paper introduces TempGlitch, a controlled benchmark for evaluating vision-language models on temporal glitch detection in gameplay videos. It defines five temporal glitch types with balanced per-category samples and paired glitch-free videos for binary evaluation. Across 12 proprietary and open-weight VLMs and multiple frame-sampling settings, the results indicate near-chance performance, with models often exhibiting overly conservative (missing glitches) or overly sensitive (false positives on clean videos) behavior; denser sampling and larger model size do not reliably improve outcomes. Code and data are released.

Significance. If the benchmark is shown to be valid, the work usefully demonstrates limitations in current VLMs for temporal reasoning in dynamic video settings, with direct relevance to automated gameplay QA. The distinction between spatial and temporal glitches, the empirical scale (12 models), and the public release of code/data are strengths that support reproducibility and further work in video understanding.

major comments (2)

[§3 (Benchmark Construction)] §3 (Benchmark Construction): The manuscript describes a 'controlled' benchmark with 'paired glitch-free videos' and 'balanced per-category samples' but provides no details on glitch insertion mechanics (synthetic frame drops vs. game-engine state changes) or the validation protocol confirming that clean videos contain no undetected temporal anomalies. This is load-bearing for the central claim that near-chance results and conservative/sensitive collapse reflect intrinsic VLM limits rather than benchmark artifacts.
[§4 (Experiments)] §4 (Experiments): The headline finding of 'near chance' performance and lack of benefit from denser sampling or scale is reported without exact per-model/per-category metrics, error bars, or statistical tests in the abstract and summary results. This weakens assessment of whether the conservative/sensitive behaviors are systematic or how strongly the scaling claim holds.

minor comments (2)

[Abstract] Abstract: Quantify 'near chance' with specific accuracy or F1 values (and ranges across models) to make the performance claim more precise.
[Figures] Figures: Ensure performance plots include clear legends, error bars where applicable, and direct comparison to chance level for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3 (Benchmark Construction)] The manuscript describes a 'controlled' benchmark with 'paired glitch-free videos' and 'balanced per-category samples' but provides no details on glitch insertion mechanics (synthetic frame drops vs. game-engine state changes) or the validation protocol confirming that clean videos contain no undetected temporal anomalies. This is load-bearing for the central claim that near-chance results and conservative/sensitive collapse reflect intrinsic VLM limits rather than benchmark artifacts.

Authors: We agree that greater transparency on benchmark construction is warranted. In the revised manuscript we will expand §3 with a dedicated subsection describing the glitch insertion process (post-processing frame drops, reordering, and duplication applied to gameplay recordings) and the validation protocol for the paired clean videos, which combined automated temporal-consistency checks with independent manual review by two annotators to confirm the absence of undetected anomalies. These additions will make explicit that the reported model behaviors cannot be attributed to benchmark artifacts. revision: yes
Referee: [§4 (Experiments)] The headline finding of 'near chance' performance and lack of benefit from denser sampling or scale is reported without exact per-model/per-category metrics, error bars, or statistical tests in the abstract and summary results. This weakens assessment of whether the conservative/sensitive behaviors are systematic or how strongly the scaling claim holds.

Authors: The detailed per-model and per-category results, including breakdowns of conservative versus sensitive error patterns, are already present in §4 and the supplementary tables. To address the concern about headline visibility, we will (i) augment the abstract with concrete accuracy ranges across the 12 models and (ii) add error bars (derived from sampling variation) together with paired statistical comparisons (Wilcoxon signed-rank tests) between sampling densities and model scales in the summary results paragraph. These changes will strengthen the presentation while preserving the original conclusions. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivations or self-referential predictions

full rationale

This paper introduces TempGlitch as a controlled benchmark for evaluating VLMs on temporal glitch detection in gameplay videos and reports direct empirical results from testing 12 models across sampling settings. No mathematical derivations, equations, fitted parameters, or first-principles predictions are present; the central claims rest on measured model performance against externally generated outputs rather than any internal reduction to self-defined inputs. The five glitch types and paired clean videos are presented as a new testbed without any claimed derivation chain that could collapse by construction. Self-citations, if any, are incidental and not load-bearing for the evaluation outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about what counts as a temporal glitch and the quality of the constructed dataset rather than mathematical free parameters or new invented entities.

axioms (1)

domain assumption Temporal glitches can be reliably distinguished from spatial ones and from clean video using ordered frame sequences.
This premise is invoked when defining the five glitch types and creating paired clean videos for binary evaluation.

pith-pipeline@v0.9.0 · 5798 in / 1265 out tokens · 23764 ms · 2026-05-21T04:58:30.088500+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection... five temporal glitch types with balanced per-category samples

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 5 internal anchors

[1]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

work page 2024
[2]

Players’ perception of bugs and glitches in video games: An exploratory study.arXiv preprint arXiv:2504.15408, 2025

Jessica Backus. Players’ perception of bugs and glitches in video games: An exploratory study.arXiv preprint arXiv:2504.15408, 2025

work page arXiv 2025
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.eprint arXiv: 2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Physgame: Uncovering physical commonsense violations in gameplay videos

Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, and Xiaodan Liang. Physgame: Uncovering physical commonsense violations in gameplay videos. arXiv preprint arXiv:2412.01800, 2024

work page arXiv 2024
[5]

Reveal-more: Amplifying human effort in quality assurance testing using automated exploration

Kenneth Chang, Batu Aytemiz, and Adam M Smith. Reveal-more: Amplifying human effort in quality assurance testing using automated exploration. In2019 IEEE Conference on Games (CoG), pages 1–8. IEEE, 2019

work page 2019
[6]

Jorge Chueca, Javier Verón, Jaime Font, Francisca Pérez, and Carlos Cetina. The consolidation of game software engineering: A systematic literature review of software engineering for industry-scale computer games.Information and Software Technology, 165:107330, 2024

work page 2024
[7]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

work page 2023
[8]

A comprehensive survey of vision-language models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets.Information Fusion, page 103623, 2025

Sufyan Danish, Abolghasem Sadeghi-Niaraki, Samee Ullah Khan, L Minh Dang, Lilia Tightiz, and Hyeonjoon Moon. A comprehensive survey of vision-language models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets.Information Fusion, page 103623, 2025

work page 2025
[9]

Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

work page 2025
[10]

FFmpeg Developers. FFmpeg. https://ffmpeg.org/, 2026

work page 2026
[11]

Vision-language pre- training: Basics, recent advances, and future trends.F oundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. Vision-language pre- training: Basics, recent advances, and future trends.F oundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

work page 2022
[12]

Asset pipeline patterns: patterns in interactive real-time visualization workflow

James Lear, Simon Scarle, and Richard McClatchey. Asset pipeline patterns: patterns in interactive real-time visualization workflow. InProceedings of the 24th European Conference on Pattern Languages of Programs, pages 1–11, 2019

work page 2019
[13]

A survey of state of the art large vision language models: Benchmark evaluations and challenges

Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1587–1606, 2025

work page 2025
[14]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

work page 2024
[15]

An empirical study of game reviews on the steam platform.Empirical Software Engineering, 24(1):170–207, 2019

Dayi Lin, Cor-Paul Bezemer, Ying Zou, and Ahmed E Hassan. An empirical study of game reviews on the steam platform.Empirical Software Engineering, 24(1):170–207, 2019

work page 2019
[16]

Ministral 3

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 10

work page 2023
[18]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[19]

Automated bug frame retrieval from gameplay videos using vision-language models.arXiv preprint arXiv:2508.04895, 2025

Wentao Lu, Alexander Senchenko, Abram Hindle, and Cor-Paul Bezemer. Automated bug frame retrieval from gameplay videos using vision-language models.arXiv preprint arXiv:2508.04895, 2025

work page arXiv 2025
[20]

Towards automated video game testing: still a long way to go

Cristiano Politowski, Yann-Gaël Guéhéneuc, and Fabio Petrillo. Towards automated video game testing: still a long way to go. InProceedings of the 6th international ICSE workshop on games and software engineering: engineering fun, inspiration, and motivation, pages 37–43, 2022

work page 2022
[21]

A survey of video game testing

Cristiano Politowski, Fabio Petrillo, and Yann-Gaël Guéhéneuc. A survey of video game testing. In2021 IEEE/ACM International Conference on Automation of Software Test (AST), pages 90–99. IEEE, 2021

work page 2021
[22]

Are game engines software frameworks? a three-perspective study.Journal of Systems and Software, 171:110846, 2021

Cristiano Politowski, Fabio Petrillo, João Eduardo Montandon, Marco Tulio Valente, and Yann-Gaël Guéhéneuc. Are game engines software frameworks? a three-perspective study.Journal of Systems and Software, 171:110846, 2021

work page 2021
[23]

Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models.arXiv preprint arXiv:2603.06148, 2026

Rohit Saxena, Alessandro Suglia, and Pasquale Minervini. Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models.arXiv preprint arXiv:2603.06148, 2026

work page arXiv 2026
[24]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Videogamebunny: Towards vision assistants for video games

Mohammad Reza Taesiri and Cor-Paul Bezemer. Videogamebunny: Towards vision assistants for video games. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1403–1413. IEEE, 2025

work page 2025
[26]

Glitchbench: Can large multimodal models detect video game glitches? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22444–22455, 2024

Mohammad Reza Taesiri, Tianjun Feng, Cor-Paul Bezemer, and Anh Nguyen. Glitchbench: Can large multimodal models detect video game glitches? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22444–22455, 2024

work page 2024
[27]

Videogameqa-bench: Evaluating vision-language models for video game quality assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Videogameqa-bench: Evaluating vision-language models for video game quality assurance. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page
[28]

Clip meets gamephysics: Towards bug identification in gameplay videos using zero-shot transfer learning

Mohammad Reza Taesiri, Finlay Macklon, and Cor-Paul Bezemer. Clip meets gamephysics: Towards bug identification in gameplay videos using zero-shot transfer learning. InProceedings of the 19th International Conference on Mining Software Repositories, pages 270–281, 2022

work page 2022
[29]

Searching bug instances in gameplay video repositories.IEEE Transactions on Games, 16(3):697–710, 2024

Mohammad Reza Taesiri, Finlay Macklon, Sarra Habchi, and Cor-Paul Bezemer. Searching bug instances in gameplay video repositories.IEEE Transactions on Games, 16(3):697–710, 2024

work page 2024
[30]

Large language models are pretty good zero-shot video game bug detectors.arXiv preprint arXiv:2210.02506, 2022

Mohammad Reza Taesiri, Finlay Macklon, Yihe Wang, Hengshuo Shen, and Cor-Paul Bezemer. Large language models are pretty good zero-shot video game bug detectors.arXiv preprint arXiv:2210.02506, 2022

work page arXiv 2022
[31]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page 2025
[33]

CogVLM: Visual expert for pretrained language models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, Jiazheng Xu, Keqin Chen, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. CogVLM: Visual expert for pretrained language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[34]

Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks

Wenhai Wang, Zhe Chen, Yangzhou Liu, Yue Cao, Weiyun Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. InLarge Vision-Language Models: Pre-training, Prompting, and Applications, pages 23–57. Springer, 2025

work page 2025
[35]

RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games

Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Resp: Reference-guided sequential prompting for visual glitch detection in video games.arXiv preprint arXiv:2604.11082, 2026. 12 A VLM Inference Details This section describes the inference providers and model-specific inference setti...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

work page 2024

[2] [2]

Players’ perception of bugs and glitches in video games: An exploratory study.arXiv preprint arXiv:2504.15408, 2025

Jessica Backus. Players’ perception of bugs and glitches in video games: An exploratory study.arXiv preprint arXiv:2504.15408, 2025

work page arXiv 2025

[3] [3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.eprint arXiv: 2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Physgame: Uncovering physical commonsense violations in gameplay videos

Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, and Xiaodan Liang. Physgame: Uncovering physical commonsense violations in gameplay videos. arXiv preprint arXiv:2412.01800, 2024

work page arXiv 2024

[5] [5]

Reveal-more: Amplifying human effort in quality assurance testing using automated exploration

Kenneth Chang, Batu Aytemiz, and Adam M Smith. Reveal-more: Amplifying human effort in quality assurance testing using automated exploration. In2019 IEEE Conference on Games (CoG), pages 1–8. IEEE, 2019

work page 2019

[6] [6]

Jorge Chueca, Javier Verón, Jaime Font, Francisca Pérez, and Carlos Cetina. The consolidation of game software engineering: A systematic literature review of software engineering for industry-scale computer games.Information and Software Technology, 165:107330, 2024

work page 2024

[7] [7]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

work page 2023

[8] [8]

A comprehensive survey of vision-language models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets.Information Fusion, page 103623, 2025

Sufyan Danish, Abolghasem Sadeghi-Niaraki, Samee Ullah Khan, L Minh Dang, Lilia Tightiz, and Hyeonjoon Moon. A comprehensive survey of vision-language models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets.Information Fusion, page 103623, 2025

work page 2025

[9] [9]

Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

work page 2025

[10] [10]

FFmpeg Developers. FFmpeg. https://ffmpeg.org/, 2026

work page 2026

[11] [11]

Vision-language pre- training: Basics, recent advances, and future trends.F oundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. Vision-language pre- training: Basics, recent advances, and future trends.F oundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

work page 2022

[12] [12]

Asset pipeline patterns: patterns in interactive real-time visualization workflow

James Lear, Simon Scarle, and Richard McClatchey. Asset pipeline patterns: patterns in interactive real-time visualization workflow. InProceedings of the 24th European Conference on Pattern Languages of Programs, pages 1–11, 2019

work page 2019

[13] [13]

A survey of state of the art large vision language models: Benchmark evaluations and challenges

Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1587–1606, 2025

work page 2025

[14] [14]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

work page 2024

[15] [15]

An empirical study of game reviews on the steam platform.Empirical Software Engineering, 24(1):170–207, 2019

Dayi Lin, Cor-Paul Bezemer, Ying Zou, and Ahmed E Hassan. An empirical study of game reviews on the steam platform.Empirical Software Engineering, 24(1):170–207, 2019

work page 2019

[16] [16]

Ministral 3

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 10

work page 2023

[18] [18]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

work page 2024

[19] [19]

Automated bug frame retrieval from gameplay videos using vision-language models.arXiv preprint arXiv:2508.04895, 2025

Wentao Lu, Alexander Senchenko, Abram Hindle, and Cor-Paul Bezemer. Automated bug frame retrieval from gameplay videos using vision-language models.arXiv preprint arXiv:2508.04895, 2025

work page arXiv 2025

[20] [20]

Towards automated video game testing: still a long way to go

Cristiano Politowski, Yann-Gaël Guéhéneuc, and Fabio Petrillo. Towards automated video game testing: still a long way to go. InProceedings of the 6th international ICSE workshop on games and software engineering: engineering fun, inspiration, and motivation, pages 37–43, 2022

work page 2022

[21] [21]

A survey of video game testing

Cristiano Politowski, Fabio Petrillo, and Yann-Gaël Guéhéneuc. A survey of video game testing. In2021 IEEE/ACM International Conference on Automation of Software Test (AST), pages 90–99. IEEE, 2021

work page 2021

[22] [22]

Are game engines software frameworks? a three-perspective study.Journal of Systems and Software, 171:110846, 2021

Cristiano Politowski, Fabio Petrillo, João Eduardo Montandon, Marco Tulio Valente, and Yann-Gaël Guéhéneuc. Are game engines software frameworks? a three-perspective study.Journal of Systems and Software, 171:110846, 2021

work page 2021

[23] [23]

Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models.arXiv preprint arXiv:2603.06148, 2026

Rohit Saxena, Alessandro Suglia, and Pasquale Minervini. Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models.arXiv preprint arXiv:2603.06148, 2026

work page arXiv 2026

[24] [24]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Videogamebunny: Towards vision assistants for video games

Mohammad Reza Taesiri and Cor-Paul Bezemer. Videogamebunny: Towards vision assistants for video games. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1403–1413. IEEE, 2025

work page 2025

[26] [26]

Glitchbench: Can large multimodal models detect video game glitches? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22444–22455, 2024

Mohammad Reza Taesiri, Tianjun Feng, Cor-Paul Bezemer, and Anh Nguyen. Glitchbench: Can large multimodal models detect video game glitches? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22444–22455, 2024

work page 2024

[27] [27]

Videogameqa-bench: Evaluating vision-language models for video game quality assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Videogameqa-bench: Evaluating vision-language models for video game quality assurance. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page

[28] [28]

Clip meets gamephysics: Towards bug identification in gameplay videos using zero-shot transfer learning

Mohammad Reza Taesiri, Finlay Macklon, and Cor-Paul Bezemer. Clip meets gamephysics: Towards bug identification in gameplay videos using zero-shot transfer learning. InProceedings of the 19th International Conference on Mining Software Repositories, pages 270–281, 2022

work page 2022

[29] [29]

Searching bug instances in gameplay video repositories.IEEE Transactions on Games, 16(3):697–710, 2024

Mohammad Reza Taesiri, Finlay Macklon, Sarra Habchi, and Cor-Paul Bezemer. Searching bug instances in gameplay video repositories.IEEE Transactions on Games, 16(3):697–710, 2024

work page 2024

[30] [30]

Large language models are pretty good zero-shot video game bug detectors.arXiv preprint arXiv:2210.02506, 2022

Mohammad Reza Taesiri, Finlay Macklon, Yihe Wang, Hengshuo Shen, and Cor-Paul Bezemer. Large language models are pretty good zero-shot video game bug detectors.arXiv preprint arXiv:2210.02506, 2022

work page arXiv 2022

[31] [31]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page 2025

[33] [33]

CogVLM: Visual expert for pretrained language models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, Jiazheng Xu, Keqin Chen, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. CogVLM: Visual expert for pretrained language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[34] [34]

Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks

Wenhai Wang, Zhe Chen, Yangzhou Liu, Yue Cao, Weiyun Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. InLarge Vision-Language Models: Pre-training, Prompting, and Applications, pages 23–57. Springer, 2025

work page 2025

[35] [35]

RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games

Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Resp: Reference-guided sequential prompting for visual glitch detection in video games.arXiv preprint arXiv:2604.11082, 2026. 12 A VLM Inference Details This section describes the inference providers and model-specific inference setti...

work page internal anchor Pith review Pith/arXiv arXiv 2026