pith. sign in

arxiv: 2605.21443 · v1 · pith:LM4CGS7Dnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

Pith reviewed 2026-05-21 04:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelstemporal glitch detectiongameplay videosvideo benchmarkquality assurancetemporal reasoning
0
0 comments X

The pith

Vision-language models perform near chance when asked to spot temporal glitches that only appear across frames in gameplay videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing vision-language models handle glitches visible in one frame far better than those that require watching changes over time. It creates TempGlitch, a benchmark of gameplay videos containing five specific temporal glitch types along with matched clean videos, to measure this gap directly. Across twelve models and different frame-sampling rates, performance stays close to random, with models either missing most real glitches or incorrectly labeling clean footage. A reader would care because game studios and video platforms increasingly rely on these models for automated quality checks, yet the results indicate that current approaches cannot reliably catch time-dependent errors without additional human review.

Core claim

We introduce TempGlitch, a controlled benchmark of gameplay videos that includes five temporal glitch categories with balanced samples and paired glitch-free videos for binary evaluation. Testing twelve proprietary and open-weight vision-language models across multiple frame-sampling settings shows performance near chance level, with models collapsing into either overly conservative or overly sensitive responses. Neither denser frame sampling nor larger model size reliably corrects these failures.

What carries the argument

The TempGlitch benchmark, which supplies ordered video frames with temporal glitches and matched clean counterparts to enable precise measurement of vision-language model performance on change-based anomalies.

If this is right

  • Vision-language models need explicit mechanisms for tracking changes across ordered frames rather than treating video as independent images.
  • Automated gameplay quality assurance systems that use these models will likely require human oversight specifically for temporal issues.
  • Future training of video models should include more tasks focused on detecting anomalies that unfold over time.
  • Evaluations limited to single-frame anomalies will continue to overestimate model readiness for full video understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The persistent failure even with more frames suggests that simply adding temporal context is not enough without changes to how models integrate frame sequences.
  • The benchmark approach could be adapted to test temporal reasoning in other video domains such as surveillance or sports analysis.
  • Developers might explore combining vision-language models with separate motion-tracking components to catch glitches that pure visual models miss.

Load-bearing premise

The five chosen temporal glitch types represent the main kinds of time-based errors that appear in real gameplay videos, and the paired clean videos contain no hidden temporal anomalies.

What would settle it

If a new vision-language model or sampling method achieves accuracy well above the reported near-chance levels on the same TempGlitch videos while human raters confirm the labels, the claim that current models cannot handle temporal glitches would be weakened.

Figures

Figures reproduced from arXiv: 2605.21443 by Adri\'an Barahona-R\'ios, Ashley Wiens, Benedict Wilkins, Cor-Paul Bezemer, Nabajeet Barman, Saman Zadtootaghaj, Yakun Yu.

Figure 1
Figure 1. Figure 1: Spatial versus temporal glitches. The top part shows temporal glitches represented by several frames from one video: Blinking glitch refers to an object intermittently appearing and disappearing across frames; Shooting glitch indicates an artefact when the firing effect occurs at an incorrect position relative to the weapon and target; Velocity glitch are implausible changes in character motion velocity, w… view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template used for constructing the spatial–temporal glitch subsets. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template for video-based glitch detection. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of Blinking glitches. E Additional Performance Metrics [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of Shooting glitches [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of velocity glitches. Case 1 Case 2 Case 3 Case 4 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of frozen glitches. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of stuck-in-place glitches. 1 FPS 5 FPS [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Frame sampling rate affects the visibility of blinking. A brief blinking glitch occurs as the [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TempGlitch, a controlled benchmark for evaluating vision-language models on temporal glitch detection in gameplay videos. It defines five temporal glitch types with balanced per-category samples and paired glitch-free videos for binary evaluation. Across 12 proprietary and open-weight VLMs and multiple frame-sampling settings, the results indicate near-chance performance, with models often exhibiting overly conservative (missing glitches) or overly sensitive (false positives on clean videos) behavior; denser sampling and larger model size do not reliably improve outcomes. Code and data are released.

Significance. If the benchmark is shown to be valid, the work usefully demonstrates limitations in current VLMs for temporal reasoning in dynamic video settings, with direct relevance to automated gameplay QA. The distinction between spatial and temporal glitches, the empirical scale (12 models), and the public release of code/data are strengths that support reproducibility and further work in video understanding.

major comments (2)
  1. [§3 (Benchmark Construction)] §3 (Benchmark Construction): The manuscript describes a 'controlled' benchmark with 'paired glitch-free videos' and 'balanced per-category samples' but provides no details on glitch insertion mechanics (synthetic frame drops vs. game-engine state changes) or the validation protocol confirming that clean videos contain no undetected temporal anomalies. This is load-bearing for the central claim that near-chance results and conservative/sensitive collapse reflect intrinsic VLM limits rather than benchmark artifacts.
  2. [§4 (Experiments)] §4 (Experiments): The headline finding of 'near chance' performance and lack of benefit from denser sampling or scale is reported without exact per-model/per-category metrics, error bars, or statistical tests in the abstract and summary results. This weakens assessment of whether the conservative/sensitive behaviors are systematic or how strongly the scaling claim holds.
minor comments (2)
  1. [Abstract] Abstract: Quantify 'near chance' with specific accuracy or F1 values (and ranges across models) to make the performance claim more precise.
  2. [Figures] Figures: Ensure performance plots include clear legends, error bars where applicable, and direct comparison to chance level for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3 (Benchmark Construction)] The manuscript describes a 'controlled' benchmark with 'paired glitch-free videos' and 'balanced per-category samples' but provides no details on glitch insertion mechanics (synthetic frame drops vs. game-engine state changes) or the validation protocol confirming that clean videos contain no undetected temporal anomalies. This is load-bearing for the central claim that near-chance results and conservative/sensitive collapse reflect intrinsic VLM limits rather than benchmark artifacts.

    Authors: We agree that greater transparency on benchmark construction is warranted. In the revised manuscript we will expand §3 with a dedicated subsection describing the glitch insertion process (post-processing frame drops, reordering, and duplication applied to gameplay recordings) and the validation protocol for the paired clean videos, which combined automated temporal-consistency checks with independent manual review by two annotators to confirm the absence of undetected anomalies. These additions will make explicit that the reported model behaviors cannot be attributed to benchmark artifacts. revision: yes

  2. Referee: [§4 (Experiments)] The headline finding of 'near chance' performance and lack of benefit from denser sampling or scale is reported without exact per-model/per-category metrics, error bars, or statistical tests in the abstract and summary results. This weakens assessment of whether the conservative/sensitive behaviors are systematic or how strongly the scaling claim holds.

    Authors: The detailed per-model and per-category results, including breakdowns of conservative versus sensitive error patterns, are already present in §4 and the supplementary tables. To address the concern about headline visibility, we will (i) augment the abstract with concrete accuracy ranges across the 12 models and (ii) add error bars (derived from sampling variation) together with paired statistical comparisons (Wilcoxon signed-rank tests) between sampling densities and model scales in the summary results paragraph. These changes will strengthen the presentation while preserving the original conclusions. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivations or self-referential predictions

full rationale

This paper introduces TempGlitch as a controlled benchmark for evaluating VLMs on temporal glitch detection in gameplay videos and reports direct empirical results from testing 12 models across sampling settings. No mathematical derivations, equations, fitted parameters, or first-principles predictions are present; the central claims rest on measured model performance against externally generated outputs rather than any internal reduction to self-defined inputs. The five glitch types and paired clean videos are presented as a new testbed without any claimed derivation chain that could collapse by construction. Self-citations, if any, are incidental and not load-bearing for the evaluation outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about what counts as a temporal glitch and the quality of the constructed dataset rather than mathematical free parameters or new invented entities.

axioms (1)
  • domain assumption Temporal glitches can be reliably distinguished from spatial ones and from clean video using ordered frame sequences.
    This premise is invoked when defining the five glitch types and creating paired clean videos for binary evaluation.

pith-pipeline@v0.9.0 · 5798 in / 1265 out tokens · 23764 ms · 2026-05-21T04:58:30.088500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 5 internal anchors

  1. [1]

    The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

  2. [2]

    Players’ perception of bugs and glitches in video games: An exploratory study.arXiv preprint arXiv:2504.15408, 2025

    Jessica Backus. Players’ perception of bugs and glitches in video games: An exploratory study.arXiv preprint arXiv:2504.15408, 2025

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.eprint arXiv: 2308.12966, 2023

  4. [4]

    Physgame: Uncovering physical commonsense violations in gameplay videos

    Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, and Xiaodan Liang. Physgame: Uncovering physical commonsense violations in gameplay videos. arXiv preprint arXiv:2412.01800, 2024

  5. [5]

    Reveal-more: Amplifying human effort in quality assurance testing using automated exploration

    Kenneth Chang, Batu Aytemiz, and Adam M Smith. Reveal-more: Amplifying human effort in quality assurance testing using automated exploration. In2019 IEEE Conference on Games (CoG), pages 1–8. IEEE, 2019

  6. [6]

    Jorge Chueca, Javier Verón, Jaime Font, Francisca Pérez, and Carlos Cetina. The consolidation of game software engineering: A systematic literature review of software engineering for industry-scale computer games.Information and Software Technology, 165:107330, 2024

  7. [7]

    Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

  8. [8]

    A comprehensive survey of vision-language models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets.Information Fusion, page 103623, 2025

    Sufyan Danish, Abolghasem Sadeghi-Niaraki, Samee Ullah Khan, L Minh Dang, Lilia Tightiz, and Hyeonjoon Moon. A comprehensive survey of vision-language models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets.Information Fusion, page 103623, 2025

  9. [9]

    Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

  10. [10]

    FFmpeg Developers. FFmpeg. https://ffmpeg.org/, 2026

  11. [11]

    Vision-language pre- training: Basics, recent advances, and future trends.F oundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

    Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. Vision-language pre- training: Basics, recent advances, and future trends.F oundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

  12. [12]

    Asset pipeline patterns: patterns in interactive real-time visualization workflow

    James Lear, Simon Scarle, and Richard McClatchey. Asset pipeline patterns: patterns in interactive real-time visualization workflow. InProceedings of the 24th European Conference on Pattern Languages of Programs, pages 1–11, 2019

  13. [13]

    A survey of state of the art large vision language models: Benchmark evaluations and challenges

    Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1587–1606, 2025

  14. [14]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

  15. [15]

    An empirical study of game reviews on the steam platform.Empirical Software Engineering, 24(1):170–207, 2019

    Dayi Lin, Cor-Paul Bezemer, Ying Zou, and Ahmed E Hassan. An empirical study of game reviews on the steam platform.Empirical Software Engineering, 24(1):170–207, 2019

  16. [16]

    Ministral 3

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

  17. [17]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 10

  18. [18]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

  19. [19]

    Automated bug frame retrieval from gameplay videos using vision-language models.arXiv preprint arXiv:2508.04895, 2025

    Wentao Lu, Alexander Senchenko, Abram Hindle, and Cor-Paul Bezemer. Automated bug frame retrieval from gameplay videos using vision-language models.arXiv preprint arXiv:2508.04895, 2025

  20. [20]

    Towards automated video game testing: still a long way to go

    Cristiano Politowski, Yann-Gaël Guéhéneuc, and Fabio Petrillo. Towards automated video game testing: still a long way to go. InProceedings of the 6th international ICSE workshop on games and software engineering: engineering fun, inspiration, and motivation, pages 37–43, 2022

  21. [21]

    A survey of video game testing

    Cristiano Politowski, Fabio Petrillo, and Yann-Gaël Guéhéneuc. A survey of video game testing. In2021 IEEE/ACM International Conference on Automation of Software Test (AST), pages 90–99. IEEE, 2021

  22. [22]

    Are game engines software frameworks? a three-perspective study.Journal of Systems and Software, 171:110846, 2021

    Cristiano Politowski, Fabio Petrillo, João Eduardo Montandon, Marco Tulio Valente, and Yann-Gaël Guéhéneuc. Are game engines software frameworks? a three-perspective study.Journal of Systems and Software, 171:110846, 2021

  23. [23]

    Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models.arXiv preprint arXiv:2603.06148, 2026

    Rohit Saxena, Alessandro Suglia, and Pasquale Minervini. Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models.arXiv preprint arXiv:2603.06148, 2026

  24. [24]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  25. [25]

    Videogamebunny: Towards vision assistants for video games

    Mohammad Reza Taesiri and Cor-Paul Bezemer. Videogamebunny: Towards vision assistants for video games. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1403–1413. IEEE, 2025

  26. [26]

    Glitchbench: Can large multimodal models detect video game glitches? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22444–22455, 2024

    Mohammad Reza Taesiri, Tianjun Feng, Cor-Paul Bezemer, and Anh Nguyen. Glitchbench: Can large multimodal models detect video game glitches? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22444–22455, 2024

  27. [27]

    Videogameqa-bench: Evaluating vision-language models for video game quality assurance

    Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Videogameqa-bench: Evaluating vision-language models for video game quality assurance. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  28. [28]

    Clip meets gamephysics: Towards bug identification in gameplay videos using zero-shot transfer learning

    Mohammad Reza Taesiri, Finlay Macklon, and Cor-Paul Bezemer. Clip meets gamephysics: Towards bug identification in gameplay videos using zero-shot transfer learning. InProceedings of the 19th International Conference on Mining Software Repositories, pages 270–281, 2022

  29. [29]

    Searching bug instances in gameplay video repositories.IEEE Transactions on Games, 16(3):697–710, 2024

    Mohammad Reza Taesiri, Finlay Macklon, Sarra Habchi, and Cor-Paul Bezemer. Searching bug instances in gameplay video repositories.IEEE Transactions on Games, 16(3):697–710, 2024

  30. [30]

    Large language models are pretty good zero-shot video game bug detectors.arXiv preprint arXiv:2210.02506, 2022

    Mohammad Reza Taesiri, Finlay Macklon, Yihe Wang, Hengshuo Shen, and Cor-Paul Bezemer. Large language models are pretty good zero-shot video game bug detectors.arXiv preprint arXiv:2210.02506, 2022

  31. [31]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  32. [32]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  33. [33]

    CogVLM: Visual expert for pretrained language models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, Jiazheng Xu, Keqin Chen, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. CogVLM: Visual expert for pretrained language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  34. [34]

    Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks

    Wenhai Wang, Zhe Chen, Yangzhou Liu, Yue Cao, Weiyun Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. InLarge Vision-Language Models: Pre-training, Prompting, and Applications, pages 23–57. Springer, 2025

  35. [35]

    RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games

    Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Resp: Reference-guided sequential prompting for visual glitch detection in video games.arXiv preprint arXiv:2604.11082, 2026. 12 A VLM Inference Details This section describes the inference providers and model-specific inference setti...