pith. sign in

arxiv: 2604.02648 · v1 · submitted 2026-04-03 · 💻 cs.SE · cs.AI

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Pith reviewed 2026-05-13 20:35 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords game benchmarkLLM evaluationbug discoveryquality assuranceautonomous software engineeringReAct agentsoftware testingruntime exploration
0
0 comments X p. Extension

The pith

LLMs detect only 48.39 percent of verified bugs in a benchmark of 30 games built to test autonomous quality assurance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GBQA, a benchmark of 30 games containing 124 human-verified bugs at three difficulty levels, to measure whether large language models can act as independent quality-assurance engineers by finding runtime bugs. It builds the games and injects the bugs through a scalable multi-agent workflow, then routes the output through human experts for verification. A baseline agent uses repeated ReAct reasoning steps plus memory to explore each game over long horizons. Experiments across frontier models show that even the strongest performer, Claude-4.6-Opus in thinking mode, locates fewer than half the confirmed bugs. This gap matters because bug discovery in dynamic environments has lagged far behind LLMs' code-generation performance, and a reliable testbed is needed to track progress toward autonomous software engineering.

Core claim

Autonomous bug discovery remains highly challenging for current LLMs. The GBQA benchmark supplies 30 games and 124 human-verified bugs constructed by a multi-agent system that develops the games and injects the bugs, followed by expert verification. A baseline interactive agent equipped with a multi-round ReAct loop and memory mechanism allows models to perform long-horizon exploration of the game environments. Across tested frontier models the highest score is 48.39 percent of the verified bugs, achieved by Claude-4.6-Opus in thinking mode.

What carries the argument

The GBQA benchmark itself: 30 games and 124 human-verified bugs at three difficulty levels, generated scalably by a multi-agent development-and-injection pipeline and paired with a ReAct-plus-memory interactive agent for long-horizon runtime exploration.

If this is right

  • Further gains on the GBQA benchmark would directly reduce the performance gap between LLM code generation and autonomous bug detection.
  • Models that improve on GBQA would demonstrate better long-horizon interaction with dynamic runtime environments.
  • The multi-agent construction method supplies a reusable way to create additional verified bug sets for other software domains.
  • Thinking modes and memory mechanisms measurably help but remain insufficient for reliable autonomous QA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If GBQA generalizes, specialized agent architectures focused on runtime state tracking rather than static code review will be required for practical autonomous QA.
  • The same construction pipeline could be adapted to produce benchmarks for non-game software where runtime bugs are equally hard to surface.
  • Low detection rates suggest that current LLMs may need explicit training signals on exploration failures before they can serve as dependable QA engineers.

Load-bearing premise

The bugs inserted by the multi-agent system behave like the real-world bugs that human developers would actually encounter during game development.

What would settle it

A follow-up experiment in which the same LLMs are run on a collection of independently written open-source games that contain only human-reported bugs never touched by the multi-agent injection process, and the detection rate is compared with the 48.39 percent figure.

Figures

Figures reproduced from arXiv: 2604.02648 by Chios Chen, Shufan Jiang, Zhiyang Chen.

Figure 1
Figure 1. Figure 1: Evolution of the software development paradigm in the LLM era. (a) Traditional human [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GBQA. Dataset is constructed using a multi-agent game builder that generates 30 game environments with 124 implanted bugs, which are annotated and categorized into three difficulty levels (Easy, Medium, Hard) by human QA experts. During evaluation, a QA agent autonomously interacts with the game environment through ReAct loops, and produces structured bug reports. Then, a critic agent verifies … view at source ↗
Figure 3
Figure 3. Figure 3: Percentage of bug discovery by difficulty [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of memory module. Each cluster corresponds to a session, and vertical [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architectural overview of the Game Environment Builder. The Producer Agent orchestrates [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of game genres across the 30 games in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Screenshots of representative game environments within [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GBQA, a benchmark of 30 games containing 124 human-verified bugs across three difficulty levels, constructed via a multi-agent system that develops games and injects bugs with human experts ensuring correctness. It supplies a ReAct-based interactive agent baseline with memory for long-horizon exploration and reports that frontier LLMs struggle at autonomous bug detection, with the best result (Claude-4.6-Opus in thinking mode) reaching only 48.39% of verified bugs.

Significance. If the benchmark bugs prove representative, GBQA would supply a scalable, human-verified testbed that quantifies current LLM limitations in dynamic runtime QA and supports reproducible progress on autonomous software engineering agents. The explicit human verification loop and provision of a memory-equipped baseline agent are concrete strengths that enable independent follow-up work.

major comments (2)
  1. [§3] §3 (Benchmark Construction): the multi-agent bug-injection procedure is described only at a high level with no taxonomy comparison or statistical matching to external corpora such as GitHub game issues or Unity bug reports. This is load-bearing for the central claim because the reported 48.39% detection rate is interpreted as evidence of inherent LLM difficulty; without evidence that the 124 bugs reflect real-world distributions (e.g., physics desyncs or rare-condition pathing failures), the performance gap may be an artifact of the injection process favoring isolated, easily detectable faults.
  2. [§4.2] §4.2 (Agent Implementation) and §5 (Results): the exact rules governing bug injection and the full specification of the ReAct memory mechanism are not provided in sufficient detail to allow independent reproduction or verification of the 48.39% figure. Human verification is stated to confirm correctness and difficulty labels, yet the absence of these implementation parameters prevents confirmation that the evaluation protocol itself is not inadvertently tuned to the benchmark.
minor comments (2)
  1. [Table 1] Table 1 and Figure 3: axis labels and legend entries use inconsistent abbreviations for model names and difficulty levels; standardize to the terminology introduced in §3.1.
  2. [§2] §2 (Related Work): the discussion of prior game-testing agents omits recent work on LLM-driven playtesting in Unity and Unreal environments; add two or three representative citations to situate GBQA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the major concerns regarding benchmark construction and reproducibility below, and will incorporate revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): the multi-agent bug-injection procedure is described only at a high level with no taxonomy comparison or statistical matching to external corpora such as GitHub game issues or Unity bug reports. This is load-bearing for the central claim because the reported 48.39% detection rate is interpreted as evidence of inherent LLM difficulty; without evidence that the 124 bugs reflect real-world distributions (e.g., physics desyncs or rare-condition pathing failures), the performance gap may be an artifact of the injection process favoring isolated, easily detectable faults.

    Authors: We agree that a more detailed characterization of the bug distribution would strengthen the claims of representativeness. In the revised manuscript we will expand §3 with an explicit taxonomy of the 124 bugs (logic errors, physics desyncs, pathing failures, rare-condition triggers, UI glitches) together with their counts per difficulty level. Where feasible we will also add a statistical comparison against publicly available game-bug corpora (GitHub issues, Unity forums). The human verification loop already guarantees that every bug is real and correctly labeled; the added taxonomy will make this explicit and address the concern that the 48.39 % figure might be an artifact of overly simple faults. These additions will be included in the revision. revision: yes

  2. Referee: [§4.2] §4.2 (Agent Implementation) and §5 (Results): the exact rules governing bug injection and the full specification of the ReAct memory mechanism are not provided in sufficient detail to allow independent reproduction or verification of the 48.39% figure. Human verification is stated to confirm correctness and difficulty labels, yet the absence of these implementation parameters prevents confirmation that the evaluation protocol itself is not inadvertently tuned to the benchmark.

    Authors: We acknowledge that greater implementation detail is required for reproducibility. In the revised manuscript we will supply (i) the precise rules used by the multi-agent bug-injection system (including the conditions and parameters for each bug category) and (ii) a complete specification of the ReAct memory mechanism (storage format, retrieval policy, and update logic across rounds). We will also expand the description of the human verification protocol with the exact criteria applied for correctness and difficulty assignment. The full source code for both benchmark construction and the baseline agent will be released publicly. The evaluation protocol follows the standard ReAct loop augmented with memory and was not tuned to the benchmark; the added details will allow independent confirmation of this fact. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper constructs GBQA via a multi-agent game-development and bug-injection pipeline followed by human verification of 124 bugs, then measures LLM detection rates (e.g., Claude-4.6-Opus at 48.39%) directly on that fixed benchmark. No equations, fitted parameters, or self-referential definitions exist that would reduce the reported detection percentages to quantities defined by the benchmark itself. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the empirical result is obtained from external frontier models evaluated on the constructed testbed and remains independent of the construction process.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about LLM agent capabilities and the representativeness of synthetically injected bugs; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption LLMs equipped with ReAct loops and memory can perform long-horizon exploration in game environments
    Invoked when describing the baseline interactive agent.
  • domain assumption Human experts can reliably verify the correctness of bugs injected by the multi-agent system
    Used to establish the ground-truth bug set of 124 items.

pith-pipeline@v0.9.0 · 5507 in / 1216 out tokens · 33760 ms · 2026-05-13T20:35:51.131327+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    fewer than three fragments

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.810. URL https://aclanthology.org/2024.acl-long.810/. Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie ...

  2. [2]

    Environment:CASTLE Candidate Report:The bedroom description reveals that there is a small key inside the bedside drawer even though the drawer has not been opened yet

    continue until the bug reproduction chain is complete explanation: short but specific justification grounded in the observed behavior F.7 WORKEDEXAMPLE The following example uses the releasedCASTLEenvironment. Environment:CASTLE Candidate Report:The bedroom description reveals that there is a small key inside the bedside drawer even though the drawer has ...

  3. [3]

    Start a newCASTLEgame session

  4. [4]

    Move from the hall to the corridor

  5. [5]

    Move from the corridor to the bedroom

  6. [6]

    explanation: The room description exposes a hidden item before the relevant container has been opened

    Executelookbefore opening the bedside drawer. explanation: The room description exposes a hidden item before the relevant container has been opened. This violates the intended visibility rule of the environment and is therefore a valid bug rather than a player-strategy issue. F.8 IMPORTANTCONSIDERATIONS • Judge against intended behavior, not preference.Do...