Recognition: no theorem link
Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate
Pith reviewed 2026-05-11 00:51 UTC · model grok-4.3
The pith
Compile-pass rate is anti-correlated with functional correctness for LLM-generated game scenes, requiring multi-axis evaluation to detect this.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol -- compile success, runtime success, structural fidelity, and mechanism adherence -- applied to 858 generation attempts across four open-weight LLMs, 26 hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C# generation achieves the highest runtime-pass rate yet produces structurally vacuous scenes. Structural IR conditioning halves the runtime rate but recovers domain-faithful tr,
What carries the argument
The Mage four-axis evaluation protocol of compile success, runtime success, structural fidelity via F1 score, and mechanism adherence via F1 score, used to compare direct NL-to-C# generation against IR-conditioned generation at behavior-only and full-scene granularities.
If this is right
- Direct natural language prompts lead to higher runtime success but near-zero mechanism adherence in generated game scenes.
- Intermediate representation conditioning improves structural fidelity to near-perfect levels but reduces runtime success rates.
- Different levels of IR granularity produce statistically equivalent results on runtime and fidelity metrics.
- Multi-axis evaluation is required to accurately assess LLM performance on complex executable domain artifacts beyond simple compilation checks.
Where Pith is reading between the lines
- Current benchmarks for code generation that rely primarily on compilation may systematically overestimate the quality of outputs in specialized domains like game development.
- Integrating structural and functional metrics into LLM evaluation pipelines could lead to models that produce more usable game scenes.
- The observed trade-off between runtime and structure suggests opportunities for hybrid generation methods that combine natural language flexibility with structured guidance.
Load-bearing premise
The 26 hand-crafted Unity goal patterns and the F1-based structural fidelity and mechanism adherence metrics serve as valid proxies for domain-faithful structure and functional correctness.
What would settle it
A large-scale human evaluation where experts rate the playability and correctness of the generated scenes and find that the high-runtime but low-F1 scenes are judged as functional as or more than the high-F1 scenes would falsify the need for multi-axis evaluation.
Figures
read the original abstract
Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile success, runtime success, structural fidelity, and mechanism adherence -- applied to 858 generation attempts across four open-weight LLMs (7B--30B), 26~hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C\# generation achieves the highest runtime-pass rate (43\% mean) yet produces structurally vacuous scenes (mechanism $F_1 \approx 0.12$). Structural IR conditioning halves the runtime rate but recovers domain-faithful structure ($F_1$ up to 1.00). Within IR conditioning, behavior-only and full-scene granularity are statistically indistinguishable (McNemar $p = 1.0$), indicating input-level granularity saturation. These results show that compile rate is anti-correlated with functional correctness in this domain and that multi-axis evaluation is necessary to detect the divergence. We release the benchmark, replay logs, and per-record metrics for independent verification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Mage four-axis evaluation protocol (compile success, runtime success, structural fidelity via F1, mechanism adherence via F1) for LLM-generated executable Unity game scenes. On 858 attempts across four open-weight LLMs (7B-30B), 26 hand-crafted goal patterns, and two IR granularity levels, direct NL-to-C# generation achieves the highest runtime rate (43%) but low mechanism F1 (~0.12), while IR conditioning halves runtime yet recovers F1 up to 1.00; behavior-only and full-scene IR are statistically indistinguishable (McNemar p=1.0). The authors conclude that compile-pass rate is anti-correlated with functional correctness in this domain and that multi-axis evaluation is necessary, releasing the benchmark, logs, and metrics.
Significance. If the results hold, the work is significant for highlighting how single-metric syntactic evaluation can mislead in complex, multi-component code generation domains such as games. The empirical demonstration of divergence between runtime success and structural/mechanism fidelity, combined with the public release of the benchmark, replay logs, and per-record metrics, provides a concrete, verifiable foundation for improving evaluation practices. This could shift community standards toward richer protocols.
major comments (1)
- [Abstract and evaluation protocol description] The central claim that runtime success is anti-correlated with functional correctness (Abstract) rests on the assumption that the automatically computed structural F1 and mechanism F1 scores are faithful proxies for 'domain-faithful structure' and 'functional correctness'. These rely exclusively on the 26 hand-crafted Unity patterns as ground truth with no reported human validation, inter-annotator agreement, or correlation to actual execution traces or playability outcomes; if the F1 extraction systematically misses or over-counts functional defects that still compile/run, both the anti-correlation finding and the necessity of multi-axis evaluation are undermined.
minor comments (2)
- Provide additional detail on the precise IR extraction rules, element matching for F1 computation, and any data exclusion criteria applied to the 858 attempts.
- The manuscript could include a brief discussion of potential biases in the hand-crafted goal patterns and how they relate to broader game scene diversity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below and will incorporate revisions to strengthen the validation of our evaluation protocol.
read point-by-point responses
-
Referee: [Abstract and evaluation protocol description] The central claim that runtime success is anti-correlated with functional correctness (Abstract) rests on the assumption that the automatically computed structural F1 and mechanism F1 scores are faithful proxies for 'domain-faithful structure' and 'functional correctness'. These rely exclusively on the 26 hand-crafted Unity patterns as ground truth with no reported human validation, inter-annotator agreement, or correlation to actual execution traces or playability outcomes; if the F1 extraction systematically misses or over-counts functional defects that still compile/run, both the anti-correlation finding and the necessity of multi-axis evaluation are undermined.
Authors: We acknowledge this is a valid concern and that the manuscript does not report human validation, IAA, or explicit correlation studies for the F1 proxies. The 26 patterns were hand-crafted by the authors as representative Unity game mechanics, with F1 computed via automated parsing of generated scenes against these patterns. In revision we will add: (1) a detailed appendix describing pattern design and full pattern list; (2) a small-scale manual analysis correlating F1 scores with human judgments of mechanism presence and playability on 50 sampled generations (including runtime-success cases); (3) discussion of extraction limitations. These additions will support the proxy claim and the anti-correlation observation without changing the reported results. The released logs and benchmark enable further external verification. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations or self-referential reductions
full rationale
The paper reports direct experimental results from 858 LLM generation attempts evaluated on four axes (compile success, runtime success, structural F1, mechanism F1) across NL-to-C# and IR-conditioned inputs. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the derivation of the central claims. The observed anti-correlation between runtime-pass rate and F1 scores, plus the statistical tests (McNemar p=1.0), are computed from the released artifacts and hand-crafted patterns without reducing any quantity to a self-defined or fitted input. The metrics are explicitly defined and applied as evaluation proxies; their validity is an external question, not a circularity issue within the reported chain.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The McNemar test assumptions hold for the paired runtime-success comparisons between granularity levels
invented entities (1)
-
Mage four-axis evaluation protocol
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle=
Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle=
-
[3]
Measuring Coding Challenge Competence with
Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , booktitle=. Measuring Coding Challenge Competence with
-
[4]
Grounding Machine Creativity in Game Design Knowledge Representations: Empirical Probing of LLM-Based Executable Synthesis of Goal Playable Patterns under Structural Constraints , author=. arXiv preprint arXiv:2603.07101 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Patterns in Game Design , author=
-
[6]
Proceedings of the 26th International Academic Mindtrek Conference , pages=
Goal Playable Concepts: Coupling Gameplay Design Patterns with Playable Concepts , author=. Proceedings of the 26th International Academic Mindtrek Conference , pages=
-
[7]
Feng, Weixi and Zhu, Wanrong and Fu, Tsu-Jui and Jampani, Varun and Akula, Arjun and He, Xuehai and Basu, Sugato and Wang, Xin Eric and Wang, William Yang , booktitle=
-
[8]
Sun, Chunyi and Han, Junlin and Deng, Weijian and Wang, Xinlong and Qin, Zishan and Gould, Stephen , journal=
-
[9]
and Schmid, Cordelia and Fathi, Alireza , booktitle=
Hu, Ziniu and Iscen, Ahmet and Jain, Aashi and Kipf, Thomas and Yue, Yisong and Ross, David A. and Schmid, Cordelia and Fathi, Alireza , booktitle=
-
[10]
Avetisyan, Armen and Xie, Christopher and Howard-Jenkins, Henry and Yang, Tsun-Yi and Aroudj, Samir and Patra, Suvam and Zhang, Fuyang and Frost, Duncan and Holland, Luke and Orme, Campbell and others , booktitle=
-
[11]
Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
The Power of Scale for Parameter-Efficient Prompt Tuning , author=. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
-
[12]
Annual Meeting of the Association for Computational Linguistics (ACL) , year=
Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. Annual Meeting of the Association for Computational Linguistics (ACL) , year=
-
[13]
Transactions on Machine Learning Research , year=
Augmented Language Models: a Survey , author=. Transactions on Machine Learning Research , year=
-
[14]
Transactions on Machine Learning Research , year=
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey , author=. Transactions on Machine Learning Research , year=
-
[15]
Grammar-Constrained Decoding for Structured
Geng, Saibo and Josifoski, Martin and Peyrard, Maxime and West, Robert , booktitle=. Grammar-Constrained Decoding for Structured
-
[16]
Scholak, Torsten and Schucher, Nathan and Bahdanau, Dzmitry , booktitle=
-
[17]
Efficient Guided Generation for Large Language Models
Efficient Guided Generation for Large Language Models , author=. arXiv preprint arXiv:2307.09702 , year=
work page internal anchor Pith review arXiv
-
[18]
2014 , publisher=
Game programming patterns , author=. 2014 , publisher=
2014
-
[19]
2018 , publisher=
Game engine architecture , author=. 2018 , publisher=
2018
-
[20]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[21]
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=
-
[22]
Qwen2.5-Coder Technical Report
Qwen2. 5-Coder Technical Report , author=. arXiv preprint arXiv:2409.12186 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Neural computation , volume=
Approximate statistical tests for comparing supervised classification learning algorithms , author=. Neural computation , volume=. 1998 , publisher=
1998
-
[25]
Psychometrika , volume=
Note on the sampling error of the difference between correlated proportions or percentages , author=. Psychometrika , volume=. 1947 , publisher=
1947
-
[26]
Communications of the ACM , volume=
Datasheets for datasets , author=. Communications of the ACM , volume=. 2021 , publisher=
2021
-
[27]
Advances in Neural Information Processing Systems , volume=
Mariogpt: Open-ended text2level generation through large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
Proceedings of the 18th International Conference on the Foundations of Digital Games , pages=
Level generation through large language models , author=. Proceedings of the 18th International Conference on the Foundations of Digital Games , pages=
-
[29]
2016 , publisher=
Procedural content generation in games , author=. 2016 , publisher=
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.