arxiv: 2605.13918 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.LG

Recognition: no theorem link

CA2: Code-Aware Agent for Automated Game Testing

Valliappan Chidambaram Adaikkappan , Vincent Martineau , Joshua Romoff , David Meger

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:49 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords game testingreinforcement learningcall stackcode coverageautomated testingsoftware verificationcode-aware agent

0 comments

The pith

Reinforcement learning agents that observe call stack traces test games more effectively than agents limited to game state alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that feeding call stack information to a reinforcement learning agent alongside the game state enables the agent to learn policies that reach specific target functions more reliably. This matters because manual game testing misses edge cases while existing automated approaches often fail to achieve full code coverage, leaving verification costly and incomplete. The authors instrument both state-based and image-based environments to supply efficient call stack traces and show that the resulting Code-Aware Agent (CA2) produces consistent gains over baselines that receive no code signals. The central mechanism is the addition of the current function call trace to the observation, which guides the agent toward targeted code execution rather than random exploration.

Core claim

CA2 receives the current function call trace along with the game state and learns to reach specific target functions. In instrumented state-based and image-based environments, this code-aware agent achieves consistent improvement over non-code-aware baselines that do not leverage call stack information.

What carries the argument

The call stack trace supplied as an additional observation signal that lets the reinforcement learning policy learn targeted paths to specific code functions.

Load-bearing premise

That supplying the call stack produces a genuine improvement in the learned testing policy rather than an artifact of the chosen games or reward design.

What would settle it

Reproducing the experiments in additional game environments where adding the call stack trace produces no measurable increase in the rate at which target functions are reached or in code coverage achieved.

Figures

Figures reproduced from arXiv: 2605.13918 by David Meger, Joshua Romoff, Valliappan Chidambaram Adaikkappan, Vincent Martineau.

**Figure 1.** Figure 1: Presenting the Code-Aware Agent (CA2) for automated game testing. It seamlessly tests low-level game functions by leveraging call stack information. algorithms that do not need to directly interact with the game engine for training.The integration of a source code profiler into our environments allows RL agents to be code-aware while interacting with the game. In addition to typical observations that incl… view at source ↗

**Figure 2.** Figure 2: Illustration of our Code-Aware Agent. All architectures use similar input formats, where a goal g, call stack s code t , environment state st, and action at are embedded and used as input. code-aware MDP induces a new MDP (S × S code, A, P) where S code is a space of call stacks. The agent in this MDP will be goal-conditioned, given a goal g ∈ G, it tries to reach a state x = (s, scode) such that g ∈ s cod… view at source ↗

read the original abstract

Automated game testing is important for verifying game functionality, but it remains a costly and time-consuming process. Manual testing often misses edge cases, and current automated methods struggle to provide full code coverage. Prior work has explored reinforcement learning (RL) for game testing, but without leveraging internal code signals such as the call stack. We present Code Aware Agent (CA2), which uses call stack information to learn effective testing strategies. The agent receives the current function call trace along with the game state and learns to reach specific target functions. We instrument two types of environments, 1) State-based and 2) Image-based, with support for efficient call stack extraction. Through experimental evaluation, we find that CA2 achieves consistent improvement over the non-code aware baselines, which does not leverage call stack information. Our results show that incorporating code signals like the call stack enables more effective and targeted game testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CA2 adds call-stack traces to RL game testing and claims better targeting, but the evidence is too thin to separate the signal from setup choices.

read the letter

The main thing here is that CA2 feeds call stack traces to an RL agent alongside game state so it can learn to hit specific target functions, and the abstract says this beats non-code-aware baselines in two instrumented environments. The instrumentation for efficient call stack extraction is a concrete step, and treating the call trace as part of the observation is a straightforward way to bring internal code structure into the loop. That part is new relative to prior RL game-testing work that stayed at the game-state level. The practical angle for automated testing is clear: if the agent can use the call stack to focus on uncovered functions, it could reduce the manual effort needed for coverage in games. The setup with both state-based and image-based environments shows they thought about different input types. The soft spot is the missing controls. The abstract reports consistent improvement but supplies no numbers, no ablation that removes the call-stack channel while holding reward and instrumentation fixed, and no check that the learned policy actually conditions on call-stack features. Without those, it is hard to rule out that the gains come from how the reward was shaped or from the instrumentation itself rather than the agent exploiting the code signal. The stress-test concern lands because the central claim needs exactly that isolation. This is for people working on RL for software testing or game QA who want ideas for adding code signals. A reader could pick up the instrumentation trick and the target-function framing, but would need the full methods and results to judge whether the improvement is real. I would send it for peer review so referees can ask for the ablations and metrics; the core idea is worth checking even if the current write-up is light on evidence.

Referee Report

3 major / 2 minor

Summary. The paper introduces CA2, a reinforcement learning agent for automated game testing that augments the observation space with call-stack traces extracted from instrumented game code. The agent is trained to reach specific target functions in two environment types (state-based and image-based) and is reported to achieve consistent improvement over non-code-aware baselines that lack access to call-stack signals.

Significance. If the experimental claims are substantiated with quantitative metrics and controls, the work would demonstrate a practical way to inject internal program state into RL-based testing agents, potentially raising code coverage and reducing manual effort in game verification pipelines.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Evaluation): the claim of 'consistent improvement' is stated without any numerical metrics, confidence intervals, statistical tests, or environment specifications (e.g., number of games, episodes, or coverage percentages), rendering the central empirical claim unverifiable from the provided text.
[§3 and §4] §3 (Method) and §4: no ablation is described that removes the call-stack channel while holding reward function, instrumentation, and observation dimensionality constant; without this control it is impossible to rule out that measured gains arise from reward shaping or instrumentation artifacts rather than genuine policy conditioning on call-stack features.
[§3.2] §3.2 (Environment Instrumentation): the paper asserts 'efficient call stack extraction' but supplies no runtime overhead measurements, extraction latency figures, or scaling behavior with call depth, leaving the practicality claim unsupported.

minor comments (2)

[§3.1] Notation for the call-trace representation (e.g., whether it is a sequence, set, or embedding) is introduced without a formal definition or example in §3.1.
[Abstract] The abstract and introduction use 'consistent improvement' without defining the metric (coverage, time-to-target, or reward) against which improvement is measured.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to clarify and strengthen our manuscript. We address each major point below and will incorporate revisions to improve verifiability and rigor.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Evaluation): the claim of 'consistent improvement' is stated without any numerical metrics, confidence intervals, statistical tests, or environment specifications (e.g., number of games, episodes, or coverage percentages), rendering the central empirical claim unverifiable from the provided text.

Authors: We agree that the abstract and §4 would benefit from explicit quantitative details to make the central claims verifiable. The experiments in §4 were conducted on two environment types with multiple target functions, reporting coverage percentages, success rates, and episode counts for CA2 versus baselines. In the revision we will update the abstract to include specific metrics (e.g., average coverage improvement and standard deviation across runs) and add confidence intervals plus statistical tests (paired t-tests) to the results tables and text in §4. revision: yes
Referee: [§3 and §4] §3 (Method) and §4: no ablation is described that removes the call-stack channel while holding reward function, instrumentation, and observation dimensionality constant; without this control it is impossible to rule out that measured gains arise from reward shaping or instrumentation artifacts rather than genuine policy conditioning on call-stack features.

Authors: The non-code-aware baselines already share the identical reward function and instrumentation; they differ solely by the absence of the call-stack input. To additionally control for observation dimensionality, we will add an explicit ablation in the revised §4 in which the call-stack channel is replaced by a dummy vector of identical dimension (zero-filled or random noise) while keeping all other components fixed. This will isolate the contribution of the actual call-stack features. revision: yes
Referee: [§3.2] §3.2 (Environment Instrumentation): the paper asserts 'efficient call stack extraction' but supplies no runtime overhead measurements, extraction latency figures, or scaling behavior with call depth, leaving the practicality claim unsupported.

Authors: We acknowledge that no quantitative overhead figures were supplied. During our experiments we recorded extraction latencies; in the revision we will add a short analysis (new paragraph or table in §3.2) reporting average per-step extraction time, overhead as a percentage of simulation time, and scaling across call depths up to the maximum observed in the test games. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical RL agent (CA2) that augments game-state observations with call-stack traces and compares performance against non-code-aware baselines across instrumented environments. No equations, parameter-fitting steps, or derivation chains are described that would reduce any claimed result to its inputs by construction. The central claim rests on reported experimental improvements rather than any self-definitional, fitted-input, or self-citation reduction; the evaluation is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5459 in / 954 out tokens · 31354 ms · 2026-05-15T05:49:50.226940+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

[1]

Politowski, F

C. Politowski, F. Petrillo, and Y.-G. Guéhéneuc.A Survey of Video Game Testing. 2021. arXiv:2103.06431 [cs.SE].url:https://arxiv.org/abs/2103.06431

work page arXiv 2021
[2]

Technical Challenges of Deploying Reinforcement Learning Agents for Game Testing in AAA Games

J. Gillberg, J. Bergdahl, A. Sestini, A. Eakins, and L. Gisslén. “Technical Challenges of Deploying Reinforcement Learning Agents for Game Testing in AAA Games”. In:2023 IEEE Conference on Games (CoG). 2023.doi:10.1109/CoG57401.2023.10333194

work page doi:10.1109/cog57401.2023.10333194 2023
[3]

Playing Atari with Deep Reinforcement Learning

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Ried- miller. “Playing atari with deep reinforcement learning”. In:arXiv preprint arXiv:1312.5602 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[4]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap.Mastering Diverse Domains through World Models. 2024. arXiv:2301.04104 [cs.AI].url:https://arxiv.org/abs/2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Gordillo, J

C. Gordillo, J. Bergdahl, K. Tollmar, and L. Gisslén.Improving Playtesting Coverage via Curiosity Driven Reinforcement Learning Agents. 2021. arXiv:2103 . 13798 [cs.LG].url: https://arxiv.org/abs/2103.13798

work page arXiv 2021
[6]

Automated Game- play Testing and Validation With Curiosity-Conditioned Proximal Trajectories

A. Sestini, L. Gisslén, J. Bergdahl, K. Tollmar, and A. D. Bagdanov. “Automated Game- play Testing and Validation With Curiosity-Conditioned Proximal Trajectories”. In:IEEE Transactions on Games16.1 (2024), pp. 113–126.doi:10.1109/TG.2022.3226910

work page doi:10.1109/tg.2022.3226910 2024
[7]

G. Liu, M. Cai, L. Zhao, T. Qin, A. Brown, J. Bischoff, and T.-Y. Liu.Inspector: Pixel- Based Automated Game Testing via Exploration, Detection, and Investigation. 2022. arXiv: 2207.08379 [cs.AI].url:https://arxiv.org/abs/2207.08379

work page arXiv 2022
[8]

OpenAI Gym

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. “Openai gym”. In:arXiv preprint arXiv:1606.01540(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

Beeching, J

E. Beeching, J. Debangoye, O. Simonin, and C. Wolf.Godot Reinforcement Learning Agents

work page
[10]

arXiv:2112.03636 [cs.LG].url:https://arxiv.org/abs/2112.03636

work page arXiv
[11]

Hafner.Benchmarking the Spectrum of Agent Capabilities

D. Hafner.Benchmarking the Spectrum of Agent Capabilities. 2022. arXiv:2109 . 06780 [cs.AI].url:https://arxiv.org/abs/2109.06780

work page arXiv 2022
[12]

GLIB: towards automated test oracle for graphically-rich applications

K. Chen, Y. Li, Y. Chen, C. Fan, Z. Hu, and W. Yang. “GLIB: towards automated test oracle for graphically-rich applications”. In:Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE ’21. ACM, Aug. 2021, 1093–1104.doi:10.1145/3468264.3468586. url:http://dx.d...

work page doi:10.1145/3468264.3468586 2021
[13]

Grandmaster level in StarCraft II using multi-agent reinforcement learning

O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In:nature575.7782 (2019), pp. 350–354

work page 2019
[14]

Lifshitz, K

S. Lifshitz, K. Paster, H. Chan, J. Ba, and S. McIlraith.STEVE-1: A Generative Model for Text-to-Behavior in Minecraft. 2024. arXiv:2306.00937 [cs.AI].url:https://arxiv.org/ abs/2306.00937

work page arXiv 2024
[15]

S. Qi, S. Chen, Y. Li, X. Kong, J. Wang, B. Yang, P. Wong, Y. Zhong, X. Zhang, Z. Zhang, N. Liu, W. Wang, Y. Yang, and S.-C. Zhu.CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents. 2024. arXiv:2401.10568 [cs.AI].url:https: //arxiv.org/abs/2401.10568

work page arXiv 2024
[16]

Baker, I

B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune.Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos. 2022. arXiv:2206.11795 [cs.LG].url:https://arxiv.org/abs/2206.11795

work page arXiv 2022
[17]

Team et al.Scaling Instructable Agents Across Many Simulated Worlds

S. Team et al.Scaling Instructable Agents Across Many Simulated Worlds. 2024. arXiv:2404. 10179 [cs.RO].url:https://arxiv.org/abs/2404.10179

work page arXiv 2024
[18]

G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models. 2023. arXiv:2305. 16291 [cs.AI].url:https://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein.Generative Agents: Interactive Simulacra of Human Behavior. 2023. arXiv:2304.03442 [cs.HC].url: https://arxiv.org/abs/2304.03442

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

LeveragingLLMAgentsforAutomated Video Game Testing

C.Wang,L.Tang,M.Yuan,J.Yu,X.Xie,andJ.Bu.“LeveragingLLMAgentsforAutomated Video Game Testing”. In:arXiv preprint arXiv:2509.22170(2025)

work page arXiv 2025
[21]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Y. Qin et al.UI-TARS: Pioneering Automated GUI Interaction with Native Agents. 2025. arXiv:2501.12326 [cs.AI].url:https://arxiv.org/abs/2501.12326

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning

Y. Zheng, X. Xie, T. Su, L. Ma, J. Hao, Z. Meng, Y. Liu, R. Shen, Y. Chen, and C. Fan. “Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning”. In:2019 34th IEEE/ACM International Conference on Automated Software En- gineering (ASE). 2019, pp. 772–784.doi:10.1109/ASE.2019.00077

work page doi:10.1109/ase.2019.00077 2019
[23]

Visualizing AI Playtesting Data of 2D Side-scrolling Games

S. Agarwal, C. Herrmann, G. Wallner, and F. Beck. “Visualizing AI Playtesting Data of 2D Side-scrolling Games”. In:2020 IEEE Conference on Games (CoG). 2020, pp. 572–575.doi: 10.1109/CoG47356.2020.9231915

work page doi:10.1109/cog47356.2020.9231915 2020
[24]

Bergdahl, C

J. Bergdahl, C. Gordillo, K. Tollmar, and L. Gisslén.Augmenting Automated Game Testing with Deep Reinforcement Learning. 2021. arXiv:2103.15819 [cs.LG].url:https://arxiv. org/abs/2103.15819

work page arXiv 2021
[25]

Automated Play-Testing through RL Based Human-Like Play-Styles Generation

P. Le Pelletier de Woillemont, R. Labory, and V. Corruble. “Automated Play-Testing through RL Based Human-Like Play-Styles Generation”. In:Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment18.1 (Oct. 2022), 146–154.issn: 2326-909X.doi:10.1609/aiide.v18i1.21958.url:http://dx.doi.org/10.1609/aiide. v18i1.21958

work page doi:10.1609/aiide.v18i1.21958.url:http://dx.doi.org/10.1609/aiide 2022
[26]

P. V. Amadori, T. Bradley, R. Spick, and G. Moss.Robust Imitation Learning for Automated Game Testing. 2024. arXiv:2401.04572 [cs.LG].url:https://arxiv.org/abs/2401.04572

work page arXiv 2024
[27]

Lynch, M

C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet.Learning Latent Plans from Play. 2019. arXiv:1903.01973 [cs.RO].url:https://arxiv.org/abs/ 1903.01973

work page arXiv 2019
[28]

I.Kostrikov,A.Nair,andS.Levine.Offline Reinforcement Learning with Implicit Q-Learning

work page
[29]

arXiv:2110.06169 [cs.LG].url:https://arxiv.org/abs/2110.06169

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Fujimoto and S

S. Fujimoto and S. S. Gu.A Minimalist Approach to Offline Reinforcement Learning. 2021. arXiv:2106.06860 [cs.LG].url:https://arxiv.org/abs/2106.06860

work page arXiv 2021
[31]

Decision transformer: Reinforcement learning via sequence modeling

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. “Decision transformer: Reinforcement learning via sequence modeling”. In:Ad- vances in neural information processing systems34 (2021), pp. 15084–15097

work page 2021
[32]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.Attention Is All You Need. 2023. arXiv:1706.03762 [cs.CL].url:https: //arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton.Layer Normalization. 2016. arXiv:1607 . 06450 [stat.ML].url:https://arxiv.org/abs/1607.06450. AppendixA.CA2 ADDITIONAL DETAILS A.1.HYPERP ARAMETERS We present the hyperparameters (Table 5) used for training our Decision Transformer with Code-Aware Agent (GC-DT-CA2) agent. This configuration reflects design choices t...

work page internal anchor Pith review Pith/arXiv arXiv 2016