pith. machine review for the scientific record. sign in

arxiv: 2605.13918 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.LG

Recognition: no theorem link

CA2: Code-Aware Agent for Automated Game Testing

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:49 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords game testingreinforcement learningcall stackcode coverageautomated testingsoftware verificationcode-aware agent
0
0 comments X

The pith

Reinforcement learning agents that observe call stack traces test games more effectively than agents limited to game state alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that feeding call stack information to a reinforcement learning agent alongside the game state enables the agent to learn policies that reach specific target functions more reliably. This matters because manual game testing misses edge cases while existing automated approaches often fail to achieve full code coverage, leaving verification costly and incomplete. The authors instrument both state-based and image-based environments to supply efficient call stack traces and show that the resulting Code-Aware Agent (CA2) produces consistent gains over baselines that receive no code signals. The central mechanism is the addition of the current function call trace to the observation, which guides the agent toward targeted code execution rather than random exploration.

Core claim

CA2 receives the current function call trace along with the game state and learns to reach specific target functions. In instrumented state-based and image-based environments, this code-aware agent achieves consistent improvement over non-code-aware baselines that do not leverage call stack information.

What carries the argument

The call stack trace supplied as an additional observation signal that lets the reinforcement learning policy learn targeted paths to specific code functions.

Load-bearing premise

That supplying the call stack produces a genuine improvement in the learned testing policy rather than an artifact of the chosen games or reward design.

What would settle it

Reproducing the experiments in additional game environments where adding the call stack trace produces no measurable increase in the rate at which target functions are reached or in code coverage achieved.

Figures

Figures reproduced from arXiv: 2605.13918 by David Meger, Joshua Romoff, Valliappan Chidambaram Adaikkappan, Vincent Martineau.

Figure 1
Figure 1. Figure 1: Presenting the Code-Aware Agent (CA2) for automated game testing. It seamlessly tests low-level game functions by leveraging call stack information. algorithms that do not need to directly interact with the game engine for training.The inte￾gration of a source code profiler into our environments allows RL agents to be code-aware while interacting with the game. In addition to typical observations that incl… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our Code-Aware Agent. All architectures use similar input formats, where a goal g, call stack s code t , environment state st, and action at are embedded and used as input. code-aware MDP induces a new MDP (S × S code, A, P) where S code is a space of call stacks. The agent in this MDP will be goal-conditioned, given a goal g ∈ G, it tries to reach a state x = (s, scode) such that g ∈ s cod… view at source ↗
read the original abstract

Automated game testing is important for verifying game functionality, but it remains a costly and time-consuming process. Manual testing often misses edge cases, and current automated methods struggle to provide full code coverage. Prior work has explored reinforcement learning (RL) for game testing, but without leveraging internal code signals such as the call stack. We present Code Aware Agent (CA2), which uses call stack information to learn effective testing strategies. The agent receives the current function call trace along with the game state and learns to reach specific target functions. We instrument two types of environments, 1) State-based and 2) Image-based, with support for efficient call stack extraction. Through experimental evaluation, we find that CA2 achieves consistent improvement over the non-code aware baselines, which does not leverage call stack information. Our results show that incorporating code signals like the call stack enables more effective and targeted game testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CA2, a reinforcement learning agent for automated game testing that augments the observation space with call-stack traces extracted from instrumented game code. The agent is trained to reach specific target functions in two environment types (state-based and image-based) and is reported to achieve consistent improvement over non-code-aware baselines that lack access to call-stack signals.

Significance. If the experimental claims are substantiated with quantitative metrics and controls, the work would demonstrate a practical way to inject internal program state into RL-based testing agents, potentially raising code coverage and reducing manual effort in game verification pipelines.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experimental Evaluation): the claim of 'consistent improvement' is stated without any numerical metrics, confidence intervals, statistical tests, or environment specifications (e.g., number of games, episodes, or coverage percentages), rendering the central empirical claim unverifiable from the provided text.
  2. [§3 and §4] §3 (Method) and §4: no ablation is described that removes the call-stack channel while holding reward function, instrumentation, and observation dimensionality constant; without this control it is impossible to rule out that measured gains arise from reward shaping or instrumentation artifacts rather than genuine policy conditioning on call-stack features.
  3. [§3.2] §3.2 (Environment Instrumentation): the paper asserts 'efficient call stack extraction' but supplies no runtime overhead measurements, extraction latency figures, or scaling behavior with call depth, leaving the practicality claim unsupported.
minor comments (2)
  1. [§3.1] Notation for the call-trace representation (e.g., whether it is a sequence, set, or embedding) is introduced without a formal definition or example in §3.1.
  2. [Abstract] The abstract and introduction use 'consistent improvement' without defining the metric (coverage, time-to-target, or reward) against which improvement is measured.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to clarify and strengthen our manuscript. We address each major point below and will incorporate revisions to improve verifiability and rigor.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Evaluation): the claim of 'consistent improvement' is stated without any numerical metrics, confidence intervals, statistical tests, or environment specifications (e.g., number of games, episodes, or coverage percentages), rendering the central empirical claim unverifiable from the provided text.

    Authors: We agree that the abstract and §4 would benefit from explicit quantitative details to make the central claims verifiable. The experiments in §4 were conducted on two environment types with multiple target functions, reporting coverage percentages, success rates, and episode counts for CA2 versus baselines. In the revision we will update the abstract to include specific metrics (e.g., average coverage improvement and standard deviation across runs) and add confidence intervals plus statistical tests (paired t-tests) to the results tables and text in §4. revision: yes

  2. Referee: [§3 and §4] §3 (Method) and §4: no ablation is described that removes the call-stack channel while holding reward function, instrumentation, and observation dimensionality constant; without this control it is impossible to rule out that measured gains arise from reward shaping or instrumentation artifacts rather than genuine policy conditioning on call-stack features.

    Authors: The non-code-aware baselines already share the identical reward function and instrumentation; they differ solely by the absence of the call-stack input. To additionally control for observation dimensionality, we will add an explicit ablation in the revised §4 in which the call-stack channel is replaced by a dummy vector of identical dimension (zero-filled or random noise) while keeping all other components fixed. This will isolate the contribution of the actual call-stack features. revision: yes

  3. Referee: [§3.2] §3.2 (Environment Instrumentation): the paper asserts 'efficient call stack extraction' but supplies no runtime overhead measurements, extraction latency figures, or scaling behavior with call depth, leaving the practicality claim unsupported.

    Authors: We acknowledge that no quantitative overhead figures were supplied. During our experiments we recorded extraction latencies; in the revision we will add a short analysis (new paragraph or table in §3.2) reporting average per-step extraction time, overhead as a percentage of simulation time, and scaling across call depths up to the maximum observed in the test games. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical RL agent (CA2) that augments game-state observations with call-stack traces and compares performance against non-code-aware baselines across instrumented environments. No equations, parameter-fitting steps, or derivation chains are described that would reduce any claimed result to its inputs by construction. The central claim rests on reported experimental improvements rather than any self-definitional, fitted-input, or self-citation reduction; the evaluation is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5459 in / 954 out tokens · 31354 ms · 2026-05-15T05:49:50.226940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

  1. [1]

    Politowski, F

    C. Politowski, F. Petrillo, and Y.-G. Guéhéneuc.A Survey of Video Game Testing. 2021. arXiv:2103.06431 [cs.SE].url:https://arxiv.org/abs/2103.06431

  2. [2]

    Technical Challenges of Deploying Reinforcement Learning Agents for Game Testing in AAA Games

    J. Gillberg, J. Bergdahl, A. Sestini, A. Eakins, and L. Gisslén. “Technical Challenges of Deploying Reinforcement Learning Agents for Game Testing in AAA Games”. In:2023 IEEE Conference on Games (CoG). 2023.doi:10.1109/CoG57401.2023.10333194

  3. [3]

    Playing Atari with Deep Reinforcement Learning

    V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Ried- miller. “Playing atari with deep reinforcement learning”. In:arXiv preprint arXiv:1312.5602 (2013)

  4. [4]

    Mastering Diverse Domains through World Models

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap.Mastering Diverse Domains through World Models. 2024. arXiv:2301.04104 [cs.AI].url:https://arxiv.org/abs/2301.04104

  5. [5]

    Gordillo, J

    C. Gordillo, J. Bergdahl, K. Tollmar, and L. Gisslén.Improving Playtesting Coverage via Curiosity Driven Reinforcement Learning Agents. 2021. arXiv:2103 . 13798 [cs.LG].url: https://arxiv.org/abs/2103.13798

  6. [6]

    Automated Game- play Testing and Validation With Curiosity-Conditioned Proximal Trajectories

    A. Sestini, L. Gisslén, J. Bergdahl, K. Tollmar, and A. D. Bagdanov. “Automated Game- play Testing and Validation With Curiosity-Conditioned Proximal Trajectories”. In:IEEE Transactions on Games16.1 (2024), pp. 113–126.doi:10.1109/TG.2022.3226910

  7. [7]

    G. Liu, M. Cai, L. Zhao, T. Qin, A. Brown, J. Bischoff, and T.-Y. Liu.Inspector: Pixel- Based Automated Game Testing via Exploration, Detection, and Investigation. 2022. arXiv: 2207.08379 [cs.AI].url:https://arxiv.org/abs/2207.08379

  8. [8]

    OpenAI Gym

    G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. “Openai gym”. In:arXiv preprint arXiv:1606.01540(2016)

  9. [9]

    Beeching, J

    E. Beeching, J. Debangoye, O. Simonin, and C. Wolf.Godot Reinforcement Learning Agents

  10. [10]

    arXiv:2112.03636 [cs.LG].url:https://arxiv.org/abs/2112.03636

  11. [11]

    Hafner.Benchmarking the Spectrum of Agent Capabilities

    D. Hafner.Benchmarking the Spectrum of Agent Capabilities. 2022. arXiv:2109 . 06780 [cs.AI].url:https://arxiv.org/abs/2109.06780

  12. [12]

    GLIB: towards automated test oracle for graphically-rich applications

    K. Chen, Y. Li, Y. Chen, C. Fan, Z. Hu, and W. Yang. “GLIB: towards automated test oracle for graphically-rich applications”. In:Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE ’21. ACM, Aug. 2021, 1093–1104.doi:10.1145/3468264.3468586. url:http://dx.d...

  13. [13]

    Grandmaster level in StarCraft II using multi-agent reinforcement learning

    O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In:nature575.7782 (2019), pp. 350–354

  14. [14]

    Lifshitz, K

    S. Lifshitz, K. Paster, H. Chan, J. Ba, and S. McIlraith.STEVE-1: A Generative Model for Text-to-Behavior in Minecraft. 2024. arXiv:2306.00937 [cs.AI].url:https://arxiv.org/ abs/2306.00937

  15. [15]

    S. Qi, S. Chen, Y. Li, X. Kong, J. Wang, B. Yang, P. Wong, Y. Zhong, X. Zhang, Z. Zhang, N. Liu, W. Wang, Y. Yang, and S.-C. Zhu.CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents. 2024. arXiv:2401.10568 [cs.AI].url:https: //arxiv.org/abs/2401.10568

  16. [16]

    Baker, I

    B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune.Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos. 2022. arXiv:2206.11795 [cs.LG].url:https://arxiv.org/abs/2206.11795

  17. [17]

    Team et al.Scaling Instructable Agents Across Many Simulated Worlds

    S. Team et al.Scaling Instructable Agents Across Many Simulated Worlds. 2024. arXiv:2404. 10179 [cs.RO].url:https://arxiv.org/abs/2404.10179

  18. [18]

    G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models. 2023. arXiv:2305. 16291 [cs.AI].url:https://arxiv.org/abs/2305.16291

  19. [19]

    J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein.Generative Agents: Interactive Simulacra of Human Behavior. 2023. arXiv:2304.03442 [cs.HC].url: https://arxiv.org/abs/2304.03442

  20. [20]

    LeveragingLLMAgentsforAutomated Video Game Testing

    C.Wang,L.Tang,M.Yuan,J.Yu,X.Xie,andJ.Bu.“LeveragingLLMAgentsforAutomated Video Game Testing”. In:arXiv preprint arXiv:2509.22170(2025)

  21. [21]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Y. Qin et al.UI-TARS: Pioneering Automated GUI Interaction with Native Agents. 2025. arXiv:2501.12326 [cs.AI].url:https://arxiv.org/abs/2501.12326

  22. [22]

    Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning

    Y. Zheng, X. Xie, T. Su, L. Ma, J. Hao, Z. Meng, Y. Liu, R. Shen, Y. Chen, and C. Fan. “Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning”. In:2019 34th IEEE/ACM International Conference on Automated Software En- gineering (ASE). 2019, pp. 772–784.doi:10.1109/ASE.2019.00077

  23. [23]

    Visualizing AI Playtesting Data of 2D Side-scrolling Games

    S. Agarwal, C. Herrmann, G. Wallner, and F. Beck. “Visualizing AI Playtesting Data of 2D Side-scrolling Games”. In:2020 IEEE Conference on Games (CoG). 2020, pp. 572–575.doi: 10.1109/CoG47356.2020.9231915

  24. [24]

    Bergdahl, C

    J. Bergdahl, C. Gordillo, K. Tollmar, and L. Gisslén.Augmenting Automated Game Testing with Deep Reinforcement Learning. 2021. arXiv:2103.15819 [cs.LG].url:https://arxiv. org/abs/2103.15819

  25. [25]

    Automated Play-Testing through RL Based Human-Like Play-Styles Generation

    P. Le Pelletier de Woillemont, R. Labory, and V. Corruble. “Automated Play-Testing through RL Based Human-Like Play-Styles Generation”. In:Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment18.1 (Oct. 2022), 146–154.issn: 2326-909X.doi:10.1609/aiide.v18i1.21958.url:http://dx.doi.org/10.1609/aiide. v18i1.21958

  26. [26]

    P. V. Amadori, T. Bradley, R. Spick, and G. Moss.Robust Imitation Learning for Automated Game Testing. 2024. arXiv:2401.04572 [cs.LG].url:https://arxiv.org/abs/2401.04572

  27. [27]

    Lynch, M

    C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet.Learning Latent Plans from Play. 2019. arXiv:1903.01973 [cs.RO].url:https://arxiv.org/abs/ 1903.01973

  28. [28]

    I.Kostrikov,A.Nair,andS.Levine.Offline Reinforcement Learning with Implicit Q-Learning

  29. [29]

    arXiv:2110.06169 [cs.LG].url:https://arxiv.org/abs/2110.06169

  30. [30]

    Fujimoto and S

    S. Fujimoto and S. S. Gu.A Minimalist Approach to Offline Reinforcement Learning. 2021. arXiv:2106.06860 [cs.LG].url:https://arxiv.org/abs/2106.06860

  31. [31]

    Decision transformer: Reinforcement learning via sequence modeling

    L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. “Decision transformer: Reinforcement learning via sequence modeling”. In:Ad- vances in neural information processing systems34 (2021), pp. 15084–15097

  32. [32]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.Attention Is All You Need. 2023. arXiv:1706.03762 [cs.CL].url:https: //arxiv.org/abs/1706.03762

  33. [33]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton.Layer Normalization. 2016. arXiv:1607 . 06450 [stat.ML].url:https://arxiv.org/abs/1607.06450. AppendixA.CA2 ADDITIONAL DETAILS A.1.HYPERP ARAMETERS We present the hyperparameters (Table 5) used for training our Decision Transformer with Code-Aware Agent (GC-DT-CA2) agent. This configuration reflects design choices t...