JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Chuanhao Li; Fanrui Zhang; Jianwen Sun; Kaipeng Zhang; Yifei Huang; Yu Dai; Yukang Feng; Zizhen Li

arxiv: 2606.19830 · v2 · pith:CP4GXYAFnew · submitted 2026-06-18 · 💻 cs.SE · cs.CL

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Jianwen Sun , Chuanhao Li , Zizhen Li , Yukang Feng , Fanrui Zhang , Yifei Huang , Yu Dai , Kaipeng Zhang This is my paper

Pith reviewed 2026-06-26 16:52 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords project-level code generationgame engine benchmarksgame jam datasetsruntime behavior evaluationcode agentsGodot enginestructural completenessbehavioral alignment

0 comments

The pith

Game jam projects yield a benchmark showing AI models drop from 80% to under 6% runtime success as game code projects grow larger.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates JamSet and JamBench by filtering over 240,000 game jam repositories down to 8,133 verified projects and 300 manually checked ones using Godot's text format and headless execution. It defines theme-driven generation and code completion tasks scored by compilation rates, structural completeness, and behavioral alignment. Frontier models show a steep capability decline with project scale, and code agents raise compilation but leave runtime behavior unchanged. JamSet also serves as effective training data for the tasks.

Core claim

The central claim is that project-level code engineering on professional game engines can be benchmarked through game jam data, revealing that model performance collapses with scale and that the limiting factor is architectural design rather than syntax.

What carries the argument

The deterministic verification pipeline that checks file integrity, compiles projects, and collects runtime behavior on the Godot engine to produce verified game frameworks.

If this is right

Runtime behavioral quality remains low even when compilation improves, so syntactic fixes alone do not solve project-level tasks.
Performance falls sharply from small to large projects, so scale must be treated as a distinct variable in code generation evaluation.
Training on the distilled JamSet data produces measurable gains on the benchmark tasks.
Architectural understanding, not just code correctness, forms the primary remaining obstacle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks of this form could be extended to other engines by adapting the headless execution and behavior collection steps.
The observed design bottleneck implies that future agents may need explicit mechanisms for tracking inter-file dependencies and global game state.
If the capability cliff holds, incremental scaling of current models is unlikely to close the gap without changes in how projects are represented.

Load-bearing premise

Game jam projects under tight deadlines serve as suitable proxies for the challenges of professional game development without introducing selection bias through the verification steps.

What would settle it

Running the same models and agents on a separate collection of professional game projects not sourced from game jams and checking whether the scale-dependent drop and agent ineffectiveness on behavior persist.

read the original abstract

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a new Godot project dataset from game jams and shows models degrade on larger projects, but the behavioral metrics may not fully track scale-dependent runtime issues.

read the letter

The main takeaway is that this work supplies the first large collection of complete, runnable Godot projects pulled from game jam repositories, paired with a benchmark that documents a steep performance drop as project size grows.

They filtered over 240,000 repos down to 8,133 verified projects via a pipeline that checks file integrity, compiles, and collects runtime behavior in headless mode. JamBench holds 300 manually reviewed ones and sets up theme-driven generation plus code completion tasks scored on compilation, structural completeness, and behavioral alignment. The nine-model results show runtime pass rates falling from roughly 80% on small projects to under 6% on large ones, with code agents lifting compilation but not behavioral quality.

The dataset release and the deterministic pipeline are concrete contributions that fill a gap left by function-level or web-game benchmarks. The public code and data make the setup checkable, and the agent comparison gives a clear signal that syntax is not the main limit.

The soft spot is the behavioral alignment score. If it under-samples stateful, interactive, or timing-dependent behaviors that appear more often in bigger projects, the reported capability cliff could partly be an artifact of the measurement rather than a pure model limit. Game jam projects are also quick, small-team builds, so they may not stand in cleanly for professional development tasks.

This is useful for groups building project-scale code generation tools or game-specific benchmarks. The data alone is worth referee attention even if the metric definitions need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces JamSet, a dataset of 8,133 verified Godot game projects distilled from over 240,000 game-jam repositories via a deterministic pipeline, and JamBench, a 300-project manually verified benchmark subset. It defines theme-driven generation and code-completion tasks evaluated by compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of nine frontier models reports runtime pass rates falling from 80.4% on small projects to 5.7% on large ones (Task 2a), with code agents raising compilation rates but producing no improvement in runtime behavioral quality, leading to the conclusion that the bottleneck is architectural design rather than syntax.

Significance. If the verification pipeline is shown to measure functional equivalence without scale-dependent bias, the work supplies the first large-scale, publicly released project-level benchmark and training corpus for professional game-engine code, documenting a clear capability cliff and isolating architectural reasoning as the limiting factor. The release of data, code, and the use of an external, independently defined source (game jams) are concrete strengths for reproducibility.

major comments (2)

[Abstract] Abstract: The headline claims of a capability cliff (80.4% → 5.7% runtime pass rate) and an architectural-design bottleneck rest on BAS correctly capturing behavioral equivalence. The abstract states only that the pipeline proceeds 'from file integrity to runtime behavior collection' using Godot headless mode, without specifying how BAS scores stateful, interactive, or timing-dependent behaviors that become more prevalent at larger scales; if these behaviors are under-sampled, the observed drop could be an artifact of the metric rather than model capability.
[Abstract] Abstract / verification pipeline: Project inclusion thresholds are listed as free parameters, yet no concrete values, sensitivity analysis, or validation against manual inspection at different scales are provided. This leaves open the possibility of selection bias that systematically affects larger projects and thereby undermines the cross-scale comparison central to the main result.

minor comments (2)

[Abstract] The abstract refers to 'nine frontier models' and 'Task2a' without enumerating the models or defining the task variants; these should be stated explicitly in the evaluation section.
[Abstract] The claim that 'Experiments validate JamSet as effective training data' is asserted without accompanying metrics, baselines, or section reference; the relevant results should be cited.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on the verification pipeline and metric definitions. We address the major comments point by point below, providing clarifications from the full manuscript and committing to revisions where appropriate to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims of a capability cliff (80.4% → 5.7% runtime pass rate) and an architectural-design bottleneck rest on BAS correctly capturing behavioral equivalence. The abstract states only that the pipeline proceeds 'from file integrity to runtime behavior collection' using Godot headless mode, without specifying how BAS scores stateful, interactive, or timing-dependent behaviors that become more prevalent at larger scales; if these behaviors are under-sampled, the observed drop could be an artifact of the metric rather than model capability.

Authors: The full manuscript provides additional details on BAS in Section 3.3, where behavioral alignment is assessed through deterministic execution in headless mode, capturing state vectors and event logs at regular intervals, with comparison via normalized edit distance on state sequences and success on predefined test scenarios derived from the original projects. For interactive behaviors, we utilize replay buffers of user inputs where present in the jam projects, and timing is handled by fixed frame rates. We agree that the abstract is concise and does not fully convey these mechanisms, which could lead to the concern raised. We will revise the abstract to include a brief description of BAS computation and add a paragraph in the methods on handling complex behaviors to mitigate concerns about under-sampling at scale. revision: yes
Referee: [Abstract] Abstract / verification pipeline: Project inclusion thresholds are listed as free parameters, yet no concrete values, sensitivity analysis, or validation against manual inspection at different scales are provided. This leaves open the possibility of selection bias that systematically affects larger projects and thereby undermines the cross-scale comparison central to the main result.

Authors: We acknowledge this as a valid observation regarding the presentation. The manuscript (Section 2.1) defines the thresholds as parameters but the specific values used in JamSet construction (e.g., minimum 3 source files, project size < 100MB, and pass rate thresholds) are provided in the supplementary materials and code release rather than the main text. No sensitivity analysis across scales was performed in the original work. To address potential selection bias, we will include the concrete parameter values in the main text, conduct a sensitivity analysis on a subsample, and report manual verification rates stratified by project size in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset and benchmark results are direct measurements from external sources

full rationale

The paper constructs JamSet (8,133 projects) and JamBench (300 projects) by filtering public game-jam repositories through a deterministic pipeline (file integrity → compilation → SCS → BAS) whose definition and execution do not depend on any model outputs or fitted parameters. The reported capability cliff (80.4 % → 5.7 % runtime pass rate) and the conclusion that code agents improve compilation but not behavioral quality are direct empirical measurements on nine frontier models; no equation, ansatz, or self-citation reduces these quantities to quantities fitted from the same models. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that game jam projects can serve as proxies for professional game development and that the automated verification pipeline produces reliable functional labels without significant bias.

free parameters (1)

Project inclusion thresholds
Specific criteria used to distill 8,133 verified projects from over 240,000 repositories are not detailed in the abstract.

axioms (1)

domain assumption Game jam projects are representative of professional game development tasks
The paper sources the entire dataset from game jam repositories and treats them as suitable for project-level code engineering benchmarks.

pith-pipeline@v0.9.1-grok · 5815 in / 1488 out tokens · 48128 ms · 2026-06-26T16:52:07.452235+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Austin, A

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program syn- thesis with large language models, 2021

2021
[2]

H. Che, X. He, Q. Liu, C. Jin, and H. Chen. Gamegen-x: Interactive open-world game video generation, 2024

2024
[3]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ry- der, M. Pavlov, A. Power, L. Kaiser, M. Bavar- ian, C. Winter, P. Tillet, F. P. Such, D. Cum- mings, M. Plappert, F. Chantzis, E. Barnes, A...

2021
[4]

Chen and A

Y.-C. Chen and A. Jhala. Gametilenet: A se- mantic dataset for low-resolution game art in procedural content generation, 2025

2025
[5]

W. Chi, Y. Fang, A. Yayavaram, S. Yayavaram, S. Karten, Q. A. Wei, R. Chen, A. Wang, V. Chen, A. Talwalkar, and C. Donahue. Gamedevbench: Evaluating agentic capabilities through game development, 2026. 9

2026
[6]

Coppola, T

R. Coppola, T. Fulcini, S. Manzi, and F. Strada. How to measure game testing: a survey of cover- age metrics. InProceedings of the ACM/IEEE 8th International Workshop on Games and Soft- ware Engineering, GAS ’24, page 15–19, New York, NY, USA, 2024. Association for Comput- ing Machinery

2024
[7]

Coutinho and L

F. Coutinho and L. Chaimowicz. On the chal- lenges of generating pixel art character sprites using gans.Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 18(1):87–94, Oct. 2022

2022
[8]

Y. Ding, Z. Wang, W. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, and B. Xiang. Crosscodee- val: A diverse and multilingual benchmark for cross-file code completion. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Infor- mation Processing Systems, volume 36, pages...

2023
[9]

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha, X. Peng, and Y. Lou. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023

2023
[10]

Earle, S

S. Earle, S. Parajuli, and A. Banburski-Fahey. Dreamgarden: A designer assistant for growing games from a single prompt. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, New York, NY, USA, 2025. Association for Computing Machin- ery

2025
[11]

Farrokhi Maleki and R

M. Farrokhi Maleki and R. Zhao. Procedural content generation in games: A survey with insights on emerging llm integration.Proceed- ings of the AAAI Conference on Artificial In- telligence and Interactive Digital Entertainment, 20(1):167–178, Nov. 2024

2024
[12]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, J. Wang, C. Zhang, z. wang, S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmid- huber. Metagpt: Meta programming for a multi-agent collaborative framework. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Representations, volume 20...

2024
[13]

Hsieh, J

C.-A. Hsieh, J. Zhang, and A. Yan. Sprite sheet diffusion: Generate game character for anima- tion, 2025

2025
[14]

S. Hu, T. Huang, G. Liu, R. R. Kompella, F. Il- han, S. F. Tekin, Y. Xu, Z. Yahn, and L. Liu. A survey on large language model-based game agents, 2025

2025
[15]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisa- tion.ArXiv, abs/2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

N. Jain, M. Shetty, T. Zhang, K. Han, K. Sen, and I. Stoica. R2e: Turning any github reposi- tory into a programming agent environment. In ICML, 2024

2024
[17]

Jiang, J

Y. Jiang, J. Hu, Q. Xiao, Y. Zheng, R. Ma, K. Feng, J. Han, T. Peng, K. Fan, M. Zhang, and X. Yue. Opengame: Open agentic coding for games, 2026

2026
[18]

Press, and K

C.E.Jimenez, J.Yang, A.Wettig, S.Yao, K.Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InB.Kim, Y.Yue, S.Chaudhuri, K.Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

2024
[19]

A. Kultima. Game jam natives?: The rise of the game jam era in game development cultures. In Proceedings of the 6th International Conference on Game Jams, Hackathons and Game Creation Events, ICGJ 2021, ACM International Con- ference Proceeding Series, pages 22–28, United States, Aug. 2021. ACM. Publisher Copyright: ©2021 ACM.; International Conference on...

2021
[20]

Kumarappan, P

A. Kumarappan, P. A. Golnari, W. Wen, X. Liu, G. Ryan, Y. Sun, S. Fu, and E. Nallipogu. De- vbench: A realistic, developer-informed bench- mark for code generation models, 2026

2026
[21]

G. Lai, A. Kultima, F. Khosmood, J. Pirker, A. Fowler, I. Vecchi, W. Latham, and F. Fol Ley- marie. Two decades of game jams. InProceedings of the 6th Annual International Conference on Game Jams, Hackathons, and Game Creation Events, ICGJ ’21, page 1–11, New York, NY, USA, 2021. Association for Computing Machin- ery

2021
[22]

J. Li, H. Deng, Y. Zhang, K. Zhang, T. Shao, T. Zhao, W. Wang, Z. Jin, G. Li, Y. Liu, Y. Fang, and Y. Dong. Realbench: A repo-level code generation benchmark aligned with real-world software development practices, 2026. 10

2026
[23]

J. Li, G. Li, X. Zhang, Y. Dong, and Z. Jin. Evocodebench: An evolving code generation benchmark aligned with real-world code reposi- tories, 2024

2024
[25]

J. Li, G. Li, Y. Zhao, Y. Li, H. Liu, H. Zhu, L. Wang, K. Liu, Z. Fang, L. Wang, J. Ding, X. Zhang, Y. Zhu, Y. Dong, Z. Jin, B. Li, F. Huang, Y. Li, B. Gu, and M. Yang. DevEval: A manually-annotated code generation bench- mark aligned with real-world code repositories. In L.-W. Ku, A. Martins, and V. Srikumar, edi- tors,Findings of the Association for Com...

2024
[26]

K. Liu, Y. Pan, Y. Xiang, D. He, J. Li, Y. Du, and T. Gao. ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Com- putational Linguistics: ACL 2025, pages 20205– 20221, Vienna, Austria, July 2025. Association f...

2025
[27]

T. Liu, C. Xu, and J. McAuley. Re- pobench: Benchmarking repository-level code auto-completion systems. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Representations, volume 2024, pages 47832–47850, 2024

2024
[28]

Madar and O

O. Madar and O. Fried. Tiled diffusion. InPro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), June 2025

2025
[29]

Ouyang, D

S. Ouyang, D. HUANG, J. Guo, Z. Sun, Q. Zhu, and J. M. Zhang. Dscodebench: A realistic benchmark for data science code generation.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 40(38):32628–32636, Mar. 2026

2026
[30]

W. Peng, X. Wang, and Q. Wu. Proxywar: Dynamic assessment of llm code generation in game arenas, 2026

2026
[31]

M. J. Scott, G. Ghinea, and I. Hamilton. Pro- moting inclusive design practice at the global game jam: A pilot evaluation. In2014 IEEE Frontiers in Education Conference (FIE) Pro- ceedings, page 1–4. IEEE, Oct. 2014

2014
[32]

Soni and A

L. Soni and A. Kaur. Merits and demerits of unreal and unity: A comprehensive comparison. In2024 International Conference on Compu- tational Intelligence for Green and Sustainable Technologies (ICCIGST), pages 1–5, 2024

2024
[33]

Sudhakaran, M

S. Sudhakaran, M. González-Duque, M. Freiberger, C. Glanois, E. Najarro, and S. Risi. Mariogpt: Open-ended text2level generation through large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 54213–54227. Curran Associates, Inc., 2023

2023
[34]

Summerville, S

A. Summerville, S. Snodgrass, M. Guzdial, C. Holmgård, A. K. Hoover, A. Isaksen, A. Nealen, and J. Togelius. Procedural con- tent generation via machine learning (pcgml), 2018

2018
[35]

W. Tan, W. Zhang, X. Xu, H. Xia, Z. Ding, B. Li, B. Zhou, J. Yue, J. Jiang, Y. Li, R. An, M. Qin, C. Zong, L. Zheng, Y. Wu, X. Chai, Y. Bi, T. Xie, P. Gu, X. Li, C. Zhang, L. Tian, C. Wang, X. Wang, B. F. Karlsson, B. An, S. Yan, and Z. Lu. Cradle: Empowering foun- dation agents towards general computer control. arXiv preprint arXiv:2403.03186, 2024

work page arXiv 2024
[36]

S. Tang, K. Zhao, L. Wang, Y. Li, X. Liu, J. Zou, Q. Wang, and X. Chu. UnrealLLM: Towards highly controllable and interactable 3D scene generation by LLM-powered procedural content generation. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 19417–19435, Vienna, Austria, July

2025
[37]

Association for Computational Linguistics
[38]

Vergopoulos, M

K. Vergopoulos, M. N. Müller, and M. Vechev. Automated benchmark generation for repository- level coding tasks, 2025

2025
[39]

G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023. 11

2023
[40]

S. Wu, Y. Huang, C. Gao, D. Chen, Q. Zhang, Y. Wan, T. Zhou, X. Zhang, J. Gao, C. Xiao, et al. Unigen: A unified framework for textual dataset generation using large language models. arXiv preprint arXiv:2406.18966, 2024

work page arXiv 2024
[41]

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software en- gineering agents, 2024

2024
[42]

Z. Xu, C. Yu, F. Fang, Y. Wang, and Y. Wu. Language agents with reinforcement learning for strategic play in the werewolf game. InPro- ceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[43]

J. Yang, C. E. Jimenez, A. Zhang, K. Lieret, J. Yang, X. Wu, O. Press, N. Muennighoff, G. Synnaeve, K. Narasimhan, D. Yang, S. Wang, and O. Press. Swe-bench multimodal: Do ai systems generalize to visual software domains? In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 2794–2...

2025
[44]

L. Yin, W. Cheng, Z. Qin, T. Huang, Y. Li, and G. Ding. Autoue: Automated generation of 3d games in unreal engine via multi-agent systems, 2026

2026
[45]

Zhang, J

K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin. CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo- level coding challenges. In L.-W. Ku, A. Mar- tins, and V. Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 13643–13658, Bangkok, ...

2024
[46]

Zhang, J

W. Zhang, J. Yang, R. Tao, L. Chai, S. Guo, J. Wu, X. Chen, G. Cui, N. Ding, X. Xu, H. Wei, and B. Zhou. V-gamegym: Visual game genera- tion for code large language models, 2025

2025
[47]

T. Y. Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul, S. Brunner, C. Gong, T. Hoang, A. R. Zebaze, X. Hong, W.-D. Li, J. Kaddour, M. Xu, Z. Zhang, P. Yadav, N. Jain, A. Gu, Z. Cheng, J. Liu, Q. Liu, Z. Wang, B. Hui, N. Muennighoff, D. Lo, D. Fried, X. Du, H. de Vries, and L. V. Werra. Bigcodebench: Benchmarkin...

2025
[48]

does it compile

T. Y. Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul, S. Brunner, C. Gong, T. Hoang, A. R. Zebaze, X. Hong, W.-D. Li, J. Kaddour, M. Xu, Z. Zhang, P. Yadav, N. Jain, A. Gu, Z. Cheng, J. Liu, Q. Liu, Z. Wang, B. Hui, N. Muennighoff, D. Lo, D. Fried, X. Du, H. de Vries, and L. V. Werra. Bigcodebench: Benchmarkin...

2025
[49]

The player node (the main controllable character/object)
[50]

Score/points tracking (if any)
[51]

Health/lives tracking (if any)
[52]

Key gameplay signals (collisions, pickups, damage, etc.)
[53]

Win/lose conditions (if identifiable)
[54]

Player",

What behaviors to expect when input is injected IMPORTANT RULES: - Only identify nodes/properties you can CONFIDENTLY determine from the code structure 26 - Use actual node paths from the scene files, not guesses - If you cannot determine something, set it to null - All node paths should be relative to the scene root (e.g., "Player", "World/Player", "UI/S...

[1] [1]

Austin, A

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program syn- thesis with large language models, 2021

2021

[2] [2]

H. Che, X. He, Q. Liu, C. Jin, and H. Chen. Gamegen-x: Interactive open-world game video generation, 2024

2024

[3] [3]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ry- der, M. Pavlov, A. Power, L. Kaiser, M. Bavar- ian, C. Winter, P. Tillet, F. P. Such, D. Cum- mings, M. Plappert, F. Chantzis, E. Barnes, A...

2021

[4] [4]

Chen and A

Y.-C. Chen and A. Jhala. Gametilenet: A se- mantic dataset for low-resolution game art in procedural content generation, 2025

2025

[5] [5]

W. Chi, Y. Fang, A. Yayavaram, S. Yayavaram, S. Karten, Q. A. Wei, R. Chen, A. Wang, V. Chen, A. Talwalkar, and C. Donahue. Gamedevbench: Evaluating agentic capabilities through game development, 2026. 9

2026

[6] [6]

Coppola, T

R. Coppola, T. Fulcini, S. Manzi, and F. Strada. How to measure game testing: a survey of cover- age metrics. InProceedings of the ACM/IEEE 8th International Workshop on Games and Soft- ware Engineering, GAS ’24, page 15–19, New York, NY, USA, 2024. Association for Comput- ing Machinery

2024

[7] [7]

Coutinho and L

F. Coutinho and L. Chaimowicz. On the chal- lenges of generating pixel art character sprites using gans.Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 18(1):87–94, Oct. 2022

2022

[8] [8]

Y. Ding, Z. Wang, W. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, and B. Xiang. Crosscodee- val: A diverse and multilingual benchmark for cross-file code completion. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Infor- mation Processing Systems, volume 36, pages...

2023

[9] [9]

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha, X. Peng, and Y. Lou. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023

2023

[10] [10]

Earle, S

S. Earle, S. Parajuli, and A. Banburski-Fahey. Dreamgarden: A designer assistant for growing games from a single prompt. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, New York, NY, USA, 2025. Association for Computing Machin- ery

2025

[11] [11]

Farrokhi Maleki and R

M. Farrokhi Maleki and R. Zhao. Procedural content generation in games: A survey with insights on emerging llm integration.Proceed- ings of the AAAI Conference on Artificial In- telligence and Interactive Digital Entertainment, 20(1):167–178, Nov. 2024

2024

[12] [12]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, J. Wang, C. Zhang, z. wang, S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmid- huber. Metagpt: Meta programming for a multi-agent collaborative framework. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Representations, volume 20...

2024

[13] [13]

Hsieh, J

C.-A. Hsieh, J. Zhang, and A. Yan. Sprite sheet diffusion: Generate game character for anima- tion, 2025

2025

[14] [14]

S. Hu, T. Huang, G. Liu, R. R. Kompella, F. Il- han, S. F. Tekin, Y. Xu, Z. Yahn, and L. Liu. A survey on large language model-based game agents, 2025

2025

[15] [15]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisa- tion.ArXiv, abs/2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

N. Jain, M. Shetty, T. Zhang, K. Han, K. Sen, and I. Stoica. R2e: Turning any github reposi- tory into a programming agent environment. In ICML, 2024

2024

[17] [17]

Jiang, J

Y. Jiang, J. Hu, Q. Xiao, Y. Zheng, R. Ma, K. Feng, J. Han, T. Peng, K. Fan, M. Zhang, and X. Yue. Opengame: Open agentic coding for games, 2026

2026

[18] [18]

Press, and K

C.E.Jimenez, J.Yang, A.Wettig, S.Yao, K.Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InB.Kim, Y.Yue, S.Chaudhuri, K.Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

2024

[19] [19]

A. Kultima. Game jam natives?: The rise of the game jam era in game development cultures. In Proceedings of the 6th International Conference on Game Jams, Hackathons and Game Creation Events, ICGJ 2021, ACM International Con- ference Proceeding Series, pages 22–28, United States, Aug. 2021. ACM. Publisher Copyright: ©2021 ACM.; International Conference on...

2021

[20] [20]

Kumarappan, P

A. Kumarappan, P. A. Golnari, W. Wen, X. Liu, G. Ryan, Y. Sun, S. Fu, and E. Nallipogu. De- vbench: A realistic, developer-informed bench- mark for code generation models, 2026

2026

[21] [21]

G. Lai, A. Kultima, F. Khosmood, J. Pirker, A. Fowler, I. Vecchi, W. Latham, and F. Fol Ley- marie. Two decades of game jams. InProceedings of the 6th Annual International Conference on Game Jams, Hackathons, and Game Creation Events, ICGJ ’21, page 1–11, New York, NY, USA, 2021. Association for Computing Machin- ery

2021

[22] [22]

J. Li, H. Deng, Y. Zhang, K. Zhang, T. Shao, T. Zhao, W. Wang, Z. Jin, G. Li, Y. Liu, Y. Fang, and Y. Dong. Realbench: A repo-level code generation benchmark aligned with real-world software development practices, 2026. 10

2026

[23] [23]

J. Li, G. Li, X. Zhang, Y. Dong, and Z. Jin. Evocodebench: An evolving code generation benchmark aligned with real-world code reposi- tories, 2024

2024

[24] [25]

J. Li, G. Li, Y. Zhao, Y. Li, H. Liu, H. Zhu, L. Wang, K. Liu, Z. Fang, L. Wang, J. Ding, X. Zhang, Y. Zhu, Y. Dong, Z. Jin, B. Li, F. Huang, Y. Li, B. Gu, and M. Yang. DevEval: A manually-annotated code generation bench- mark aligned with real-world code repositories. In L.-W. Ku, A. Martins, and V. Srikumar, edi- tors,Findings of the Association for Com...

2024

[25] [26]

K. Liu, Y. Pan, Y. Xiang, D. He, J. Li, Y. Du, and T. Gao. ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Com- putational Linguistics: ACL 2025, pages 20205– 20221, Vienna, Austria, July 2025. Association f...

2025

[26] [27]

T. Liu, C. Xu, and J. McAuley. Re- pobench: Benchmarking repository-level code auto-completion systems. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Representations, volume 2024, pages 47832–47850, 2024

2024

[27] [28]

Madar and O

O. Madar and O. Fried. Tiled diffusion. InPro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), June 2025

2025

[28] [29]

Ouyang, D

S. Ouyang, D. HUANG, J. Guo, Z. Sun, Q. Zhu, and J. M. Zhang. Dscodebench: A realistic benchmark for data science code generation.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 40(38):32628–32636, Mar. 2026

2026

[29] [30]

W. Peng, X. Wang, and Q. Wu. Proxywar: Dynamic assessment of llm code generation in game arenas, 2026

2026

[30] [31]

M. J. Scott, G. Ghinea, and I. Hamilton. Pro- moting inclusive design practice at the global game jam: A pilot evaluation. In2014 IEEE Frontiers in Education Conference (FIE) Pro- ceedings, page 1–4. IEEE, Oct. 2014

2014

[31] [32]

Soni and A

L. Soni and A. Kaur. Merits and demerits of unreal and unity: A comprehensive comparison. In2024 International Conference on Compu- tational Intelligence for Green and Sustainable Technologies (ICCIGST), pages 1–5, 2024

2024

[32] [33]

Sudhakaran, M

S. Sudhakaran, M. González-Duque, M. Freiberger, C. Glanois, E. Najarro, and S. Risi. Mariogpt: Open-ended text2level generation through large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 54213–54227. Curran Associates, Inc., 2023

2023

[33] [34]

Summerville, S

A. Summerville, S. Snodgrass, M. Guzdial, C. Holmgård, A. K. Hoover, A. Isaksen, A. Nealen, and J. Togelius. Procedural con- tent generation via machine learning (pcgml), 2018

2018

[34] [35]

W. Tan, W. Zhang, X. Xu, H. Xia, Z. Ding, B. Li, B. Zhou, J. Yue, J. Jiang, Y. Li, R. An, M. Qin, C. Zong, L. Zheng, Y. Wu, X. Chai, Y. Bi, T. Xie, P. Gu, X. Li, C. Zhang, L. Tian, C. Wang, X. Wang, B. F. Karlsson, B. An, S. Yan, and Z. Lu. Cradle: Empowering foun- dation agents towards general computer control. arXiv preprint arXiv:2403.03186, 2024

work page arXiv 2024

[35] [36]

S. Tang, K. Zhao, L. Wang, Y. Li, X. Liu, J. Zou, Q. Wang, and X. Chu. UnrealLLM: Towards highly controllable and interactable 3D scene generation by LLM-powered procedural content generation. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 19417–19435, Vienna, Austria, July

2025

[36] [37]

Association for Computational Linguistics

[37] [38]

Vergopoulos, M

K. Vergopoulos, M. N. Müller, and M. Vechev. Automated benchmark generation for repository- level coding tasks, 2025

2025

[38] [39]

G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023. 11

2023

[39] [40]

S. Wu, Y. Huang, C. Gao, D. Chen, Q. Zhang, Y. Wan, T. Zhou, X. Zhang, J. Gao, C. Xiao, et al. Unigen: A unified framework for textual dataset generation using large language models. arXiv preprint arXiv:2406.18966, 2024

work page arXiv 2024

[40] [41]

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software en- gineering agents, 2024

2024

[41] [42]

Z. Xu, C. Yu, F. Fang, Y. Wang, and Y. Wu. Language agents with reinforcement learning for strategic play in the werewolf game. InPro- ceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024

[42] [43]

J. Yang, C. E. Jimenez, A. Zhang, K. Lieret, J. Yang, X. Wu, O. Press, N. Muennighoff, G. Synnaeve, K. Narasimhan, D. Yang, S. Wang, and O. Press. Swe-bench multimodal: Do ai systems generalize to visual software domains? In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 2794–2...

2025

[43] [44]

L. Yin, W. Cheng, Z. Qin, T. Huang, Y. Li, and G. Ding. Autoue: Automated generation of 3d games in unreal engine via multi-agent systems, 2026

2026

[44] [45]

Zhang, J

K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin. CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo- level coding challenges. In L.-W. Ku, A. Mar- tins, and V. Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 13643–13658, Bangkok, ...

2024

[45] [46]

Zhang, J

W. Zhang, J. Yang, R. Tao, L. Chai, S. Guo, J. Wu, X. Chen, G. Cui, N. Ding, X. Xu, H. Wei, and B. Zhou. V-gamegym: Visual game genera- tion for code large language models, 2025

2025

[46] [47]

T. Y. Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul, S. Brunner, C. Gong, T. Hoang, A. R. Zebaze, X. Hong, W.-D. Li, J. Kaddour, M. Xu, Z. Zhang, P. Yadav, N. Jain, A. Gu, Z. Cheng, J. Liu, Q. Liu, Z. Wang, B. Hui, N. Muennighoff, D. Lo, D. Fried, X. Du, H. de Vries, and L. V. Werra. Bigcodebench: Benchmarkin...

2025

[47] [48]

does it compile

T. Y. Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul, S. Brunner, C. Gong, T. Hoang, A. R. Zebaze, X. Hong, W.-D. Li, J. Kaddour, M. Xu, Z. Zhang, P. Yadav, N. Jain, A. Gu, Z. Cheng, J. Liu, Q. Liu, Z. Wang, B. Hui, N. Muennighoff, D. Lo, D. Fried, X. Du, H. de Vries, and L. V. Werra. Bigcodebench: Benchmarkin...

2025

[48] [49]

The player node (the main controllable character/object)

[49] [50]

Score/points tracking (if any)

[50] [51]

Health/lives tracking (if any)

[51] [52]

Key gameplay signals (collisions, pickups, damage, etc.)

[52] [53]

Win/lose conditions (if identifiable)

[53] [54]

Player",

What behaviors to expect when input is injected IMPORTANT RULES: - Only identify nodes/properties you can CONFIDENTLY determine from the code structure 26 - Use actual node paths from the scene files, not guesses - If you cannot determine something, set it to null - All node paths should be relative to the scene root (e.g., "Player", "World/Player", "UI/S...