An Exploratory Case Study of LLM-Assisted Refactoring and Gameplay Feature Generation in an Endless Runner Game

Jan Wunderlich; Markus Kleffmann; Sebastian Lempert

arxiv: 2606.21171 · v1 · pith:IAFSHL5Onew · submitted 2026-06-19 · 💻 cs.SE · cs.AI

An Exploratory Case Study of LLM-Assisted Refactoring and Gameplay Feature Generation in an Endless Runner Game

Jan Wunderlich , Markus Kleffmann , Sebastian Lempert This is my paper

Pith reviewed 2026-06-26 13:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM-assisted refactoringgameplay feature generationendless runnercase studyGPT-4oPython Pygamesoftware integrationexploratory study

0 comments

The pith

GPT-4o completed all three localized refactoring tasks in an endless runner but only one of three new gameplay features integrated correctly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports results from an exploratory case study that applied GPT-4o to six tasks inside a custom Python and Pygame endless runner game. Three tasks required localized refactoring of existing code and all produced functionally working results. The remaining three tasks asked the model to generate new gameplay features that interact with multiple existing systems, and only one produced a correctly integrated outcome. A reader would care because the work isolates a concrete difference in reliability between simple, contained code edits and changes that must coordinate across an existing game codebase, giving early evidence on where LLM assistance may reduce effort and where it still demands substantial human correction.

Core claim

In this case study, all three selected refactoring tasks were completed successfully in functional terms, whereas only one of the three selected gameplay feature generation tasks resulted in a correctly integrated feature. The findings suggest that, in this setting, GPT-4o handled localized transformations more reliably than tasks requiring new gameplay interactions across multiple existing systems. Given the exploratory single-case design, these results are best interpreted as indicative observations rather than as generalizable evidence of category-level model performance.

What carries the argument

The six selected development tasks (three localized refactoring tasks and three gameplay feature generation tasks) performed inside one custom Python/Pygame endless runner and evaluated with software metrics, unit tests, and manual gameplay assessment.

If this is right

Localized refactoring tasks can be completed in functional form by the model.
Gameplay feature tasks that span multiple existing systems frequently fail to produce correctly integrated code.
Opportunities for LLM assistance appear greater in maintenance-style work than in expansion work that touches several systems.
The single-case exploratory design limits claims to observations about this particular game and task set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reliability gap may appear when the same model is applied to feature additions in other game engines or larger codebases.
Teams could route only small, self-contained refactors to the model while routing cross-system changes to human developers.
A follow-up study that systematically varies the number of systems touched by each feature task could test whether interaction count predicts success rate.

Load-bearing premise

The three selected gameplay feature generation tasks accurately represent the challenges of adding new features that interact across multiple systems in game development.

What would settle it

Repeating the three feature-generation tasks in the same game after a code refactor that changes how the systems communicate and checking whether the success count stays at one or rises or falls.

Figures

Figures reproduced from arXiv: 2606.21171 by Jan Wunderlich, Markus Kleffmann, Sebastian Lempert.

**Figure 1.** Figure 1: Example gameplay scene from the custom Python/Pygame endless runner used as the software artifact in this study. behaves when supporting localized transformations of existing code compared with more integration-intensive extensions of gameplay functionality. 3 Methodology This study follows a structured empirical case-study design to evaluate the use of GPT-4o in a concrete gamedevelopment setting. The u… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used to support software development, but their practical usefulness in applied game-development settings remains underexplored, especially when generated code must be integrated into an existing game software system. This paper presents an exploratory empirical case study of GPT-4o in a custom Python/Pygame endless runner. The study examines six selected development tasks: three localized refactoring tasks and three tasks involving gameplay feature generation. The resulting implementations were evaluated using software metrics, unit tests, and manual gameplay assessments. In this case study, all three selected refactoring tasks were completed successfully in functional terms, whereas only one of the three selected gameplay feature generation tasks resulted in a correctly integrated feature. The findings suggest that, in this setting, GPT-4o handled localized transformations more reliably than tasks requiring new gameplay interactions across multiple existing systems. Given the exploratory single-case design, these results are best interpreted as indicative observations rather than as generalizable evidence of category-level model performance. Overall, the paper contributes a transparent case-based account of the opportunities and limitations of LLM-assisted refactoring and gameplay feature generation in an existing game software system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small single-case study showing GPT-4o succeeding on refactoring but struggling with cross-system feature additions in one Pygame game.

read the letter

The main point is that in this endless runner, GPT-4o completed all three refactoring tasks in functional terms but only one of the three gameplay feature tasks, with the difference tied to localized changes versus those needing new interactions across existing systems.

The paper does a solid job of keeping the scope narrow and stating its exploratory single-case limits up front. It reports outcomes through software metrics, unit tests, and manual gameplay checks, which gives a concrete record of what happened when trying to integrate generated code into an existing Python/Pygame codebase. That kind of applied example is still relatively rare, so the account of integration friction adds a usable data point even if it stays small.

The soft spot is the representativeness of the three feature tasks. The abstract gives no descriptions, interaction details, or code examples, so it is difficult to confirm that the failures came from multi-system demands rather than task selection, prompt wording, or evaluation choices. With only one game and six tasks total, the localized-versus-cross-system distinction remains suggestive rather than firmly established.

This is for readers who track practical LLM use in game development or similar domains and want real integration examples rather than broad claims. It will not shift general understanding of model capabilities, but it supplies one transparent instance of where the tools hit limits.

I would send it for peer review. The reporting is direct, the caveats are appropriate, and more documented cases like this can accumulate into something more useful even if each one stays limited.

Referee Report

2 major / 0 minor

Summary. The paper presents an exploratory single-case study of GPT-4o assisting with six development tasks in a custom Python/Pygame endless runner game: three localized refactoring tasks (all reported as functionally successful) and three gameplay feature generation tasks (only one reported as correctly integrated). The authors conclude that, in this setting, the model handled localized transformations more reliably than tasks requiring new gameplay interactions across multiple existing systems, while explicitly noting the single-case design renders the results indicative rather than generalizable.

Significance. If the task classifications and outcomes hold, the study supplies a transparent, concrete account of LLM integration challenges in an existing game codebase, distinguishing localized edits from cross-system feature additions. This is a modest but useful contribution to the empirical literature on LLM-assisted software engineering in applied domains; the explicit caveats about generalizability and the use of multiple evaluation methods (metrics, tests, manual assessment) strengthen its value as a starting point for targeted follow-up work.

major comments (2)

[Abstract] Abstract: The central comparative claim (3/3 refactoring success vs. 1/3 feature-generation success) attributes the difference to 'tasks requiring new gameplay interactions across multiple existing systems.' No task descriptions, selection criteria, interaction diagrams, or code diffs are supplied, so it is impossible to verify that the three feature tasks actually exercised cross-system interactions or that failures stemmed from that property rather than prompt formulation, task difficulty, or evaluation criteria. This is load-bearing for the strongest claim.
[Abstract] Abstract (evaluation paragraph): The paper states that implementations were evaluated with 'software metrics, unit tests, and manual gameplay assessments,' yet provides no concrete metrics, pass/fail thresholds, or examples of how 'correctly integrated' was operationalized for the feature tasks. Without these details the reported success rates cannot be independently assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our exploratory case study. The comments correctly identify that the abstract would be strengthened by additional details on task selection and evaluation criteria to support the reported outcomes. We will revise the abstract in the next version to address both points while preserving the paper's focus on indicative, single-case observations.

read point-by-point responses

Referee: [Abstract] Abstract: The central comparative claim (3/3 refactoring success vs. 1/3 feature-generation success) attributes the difference to 'tasks requiring new gameplay interactions across multiple existing systems.' No task descriptions, selection criteria, interaction diagrams, or code diffs are supplied, so it is impossible to verify that the three feature tasks actually exercised cross-system interactions or that failures stemmed from that property rather than prompt formulation, task difficulty, or evaluation criteria. This is load-bearing for the strongest claim.

Authors: We agree the abstract should supply more context. The full manuscript (Section 3) describes the six tasks, their selection (chosen as representative of common game-dev activities: localized refactors vs. new mechanics), and notes that the unsuccessful feature tasks required edits spanning player physics, level generation, and collision systems. The successful feature was more self-contained. No interaction diagrams or diffs appear because the study emphasizes outcomes over implementation artifacts, but the distinction is drawn from direct inspection of generated code and integration failures. We will add a concise sentence to the abstract summarizing task categories and the cross-system scope of the feature tasks. revision: yes
Referee: [Abstract] Abstract (evaluation paragraph): The paper states that implementations were evaluated with 'software metrics, unit tests, and manual gameplay assessments,' yet provides no concrete metrics, pass/fail thresholds, or examples of how 'correctly integrated' was operationalized for the feature tasks. Without these details the reported success rates cannot be independently assessed.

Authors: We accept that the abstract is too terse on evaluation. Section 4 of the manuscript specifies the criteria: refactoring success required passing unit tests plus non-regression on cyclomatic complexity and maintainability index; feature success required the new mechanic to execute without runtime errors or breakage of existing systems across 10 manual play sessions. We will revise the abstract to briefly state these operationalizations (e.g., 'success defined as functional integration verified by unit tests and manual assessment showing no defects in core loops'). revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical case study with no derivations or fitted parameters

full rationale

The paper is an exploratory single-case empirical report on six concrete development tasks performed with GPT-4o in one Pygame codebase. It contains no equations, no parameter fitting, no predictions derived from models, and no self-citation chains that reduce any claim to its own inputs by construction. All observations (success on refactoring tasks vs. partial success on feature tasks) are presented as direct outcomes of the described experiments, explicitly qualified as non-generalizable indicative findings. None of the six enumerated circularity patterns apply; the central claim rests on the reported task outcomes rather than on any definitional or fitted reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities as this is an empirical case study without theoretical modeling.

pith-pipeline@v0.9.1-grok · 5734 in / 914 out tokens · 26400 ms · 2026-06-26T13:56:08.553792+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 15 canonical work pages · 1 internal anchor

[1]

Latency and player actions in online games

M. Claypool and K. Claypool. “Latency and player actions in online games”. In:Commun. ACM49.11 (Nov. 2006), pp. 40–45.doi:10 . 1145/1167838.1167860

arXiv 2006
[2]

Caterini

J. Cordeiro, S. Noei, and Y. Zou.An Empirical Study on the Code Refactoring Capability of Large Language Models. 2024.doi:10.48550/ARXIV. 2411.02320

work page internal anchor Pith review doi:10.48550/arxiv 2024
[3]

Large Language Models and Games: A Survey and Roadmap

R. Gallotta et al. “Large Language Models and Games: A Survey and Roadmap”. In:IEEE Trans. Games(2024), pp. 1–18.doi:10.1109/ TG.2024.3461510

arXiv 2024
[4]

On the Effectiveness of Large Lan- guage Models in Domain-Specific Code Genera- tion

X. Gu et al. “On the Effectiveness of Large Lan- guage Models in Domain-Specific Code Genera- tion”. In:ACM Transactions on Software Engi- neering and Methodology34.3 (Mar. 2025), pp. 1– 22.doi:10.1145/3697012

work page doi:10.1145/3697012 2025
[5]

https: 43 //doi.org/10.1145/3643991.3645072 .https://doi.org/10.1145/3643991.3645072

K. Jin et al. “Can ChatGPT Support Developers? AnEmpiricalEvaluationofLargeLanguageMod- els for Code Generation”. In:Proc. 21st Int. Conf. Mining Software Repositories. ACM, Apr. 2024, pp. 167–171.doi:10.1145/3643991.3645074

work page doi:10.1145/3643991.3645074 2024
[6]

AI literacy and its implications for prompt engineering strategies

N. Knoth et al. “AI literacy and its implications for prompt engineering strategies”. In:Comput. Educ.: Artif. Intell.6 (June 2024), p. 100225.doi: 10.1016/j.caeai.2024.100225

work page doi:10.1016/j.caeai.2024.100225 2024
[7]

Liu et al.An Empirical Study on the Poten- tial of LLMs in Automated Software Refactoring

B. Liu et al.An Empirical Study on the Poten- tial of LLMs in Automated Software Refactoring. 2024.doi:10.48550/ARXIV.2411.04444

work page doi:10.48550/arxiv.2411.04444 2024
[8]

Guiding ChatGPT for Better Code Generation: An Empirical Study

C. Liu et al. “Guiding ChatGPT for Better Code Generation: An Empirical Study”. In:IEEE Int. Conf. Software Analysis, Evolution and Reengi- neering. IEEE, Mar. 2024, pp. 102–113.doi:10. 1109/SANER60148.2024.00018

arXiv 2024
[9]

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Qual- ity Issues

Y. Liu et al. “Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Qual- ity Issues”. In:ACM Transactions on Software Engineering and Methodology33.5 (June 2024), pp. 1–26.doi:10.1145/3643674

work page doi:10.1145/3643674 2024
[10]

No Need to Lift a Finger Any- more? Assessing the Quality of Code Generation by ChatGPT

Z. Liu et al. “No Need to Lift a Finger Any- more? Assessing the Quality of Code Generation by ChatGPT”. In:IEEE Transactions on Soft- ware Engineering50.6 (June 2024), pp. 1548– 1584.doi:10.1109/TSE.2024.3392499

work page doi:10.1109/tse.2024.3392499 2024
[11]

Leverag- ing Large Language Models for Efficient Failure Analysis in Game Development

L. Marini, L. Gisslén, and A. Sestini. “Leverag- ing Large Language Models for Efficient Failure Analysis in Game Development”. In:IEEE Conf. Games (CoG). IEEE, Aug. 2024, pp. 1–8.doi: 10.1109/CoG60054.2024.10645540

work page doi:10.1109/cog60054.2024.10645540 2024
[12]

Proof automation with large language models,

N. S. Mathews and M. Nagappan. “Test-Driven Development and LLM-based Code Generation”. In:Proc. 39th IEEE/ACM Int. Conf. Automated Software Engineering. ACM, Oct. 2024, pp. 1583– 1594.doi:10.1145/3691620.3695527

work page doi:10.1145/3691620.3695527 2024
[13]

Nejjar, L

M. Nejjar et al. “LLMs for science: Usage for code generation and data analysis”. In:Journal of Software: Evolution and Process37.1 (Jan. 2025), e2723.doi:10.1002/smr.2723

work page doi:10.1002/smr.2723 2025
[14]

OpenAI.GPT-4o System Card. Tech. rep.https: //openai.com/index/gpt- 4o- system- card/. OpenAI, 2024

2024
[15]

Rasnayaka et al.An Empirical Study on Usage and Perceptions of LLMs in a Software Engineer- ing Project

S. Rasnayaka et al.An Empirical Study on Usage and Perceptions of LLMs in a Software Engineer- ing Project. 2024.doi:10.48550/ARXIV.2401. 16186

work page doi:10.48550/arxiv.2401 2024
[16]

The Programmer’s Assistant: Conversational Interaction with a Large Lan- guage Model for Software Development

S. I. Ross et al. “The Programmer’s Assistant: Conversational Interaction with a Large Lan- guage Model for Software Development”. In: Proc. 28th Int. Conf. Intelligent User Interfaces. ACM, Mar. 2023, pp. 491–514.doi:10 . 1145 / 3581641.3584037

arXiv 2023
[17]

Methodology for Code Synthesis Evaluation of LLMs Presented by a Case Study of ChatGPT and Copilot

Z. Ságodi, I. Siket, and R. Ferenc. “Methodology for Code Synthesis Evaluation of LLMs Presented by a Case Study of ChatGPT and Copilot”. In: IEEE Access12 (2024), pp. 72303–72316.doi: 10.1109/ACCESS.2024.3403858

work page doi:10.1109/access.2024.3403858 2024
[18]

Sahoo et al.A Systematic Survey of Prompt Engineering in Large Language Models: Tech- niques and Applications

P. Sahoo et al.A Systematic Survey of Prompt Engineering in Large Language Models: Tech- niques and Applications. 2024.doi:10 . 48550 / ARXIV.2402.07927

Pith/arXiv arXiv 2024
[19]

Refactoring

A. Shirafuji et al. “Refactoring Programs Us- ing Large Language Models with Few-Shot Ex- amples”. In:30th Asia-Pacific Softw. Eng. Conf. (APSEC). IEEE, Dec. 2023, pp. 151–160.doi: 10.1109/APSEC60848.2023.00025

work page doi:10.1109/apsec60848.2023.00025 2023
[20]

Quality Assessment of Chat- GPT Generated Code and their Use by De- velopers

M. L. Siddiq et al. “Quality Assessment of Chat- GPT Generated Code and their Use by De- velopers”. In:Proceedings of the 21st Interna- tional Conference on Mining Software Reposito- ries. ACM, Apr. 2024, pp. 152–156.doi:10 . 1145/3643991.3645071

arXiv 2024
[21]

https://docs.sonarsource.com/sonarqube- server/user- guide/code- metrics/metrics- definition

SonarQube.Understanding measures and met- rics | SonarQube Server | Sonar Documentation. https://docs.sonarsource.com/sonarqube- server/user- guide/code- metrics/metrics- definition. 2026

2026
[22]

Sweetser.Large Language Models and Video Games: A Preliminary Scoping Review

P. Sweetser.Large Language Models and Video Games: A Preliminary Scoping Review. 2024. doi:10.48550/ARXIV.2403.02613

work page doi:10.48550/arxiv.2403.02613 2024
[23]

GPT for Games: An Updated Scoping Review (2020- 2024)

D. Yang, E. Kleinman, and C. Harteveld. “GPT for Games: An Updated Scoping Review (2020- 2024)”. In:IEEE Transactions on Games(2025), pp. 1–16.doi:10.1109/TG.2025.3563780. 7

work page doi:10.1109/tg.2025.3563780 2020

[1] [1]

Latency and player actions in online games

M. Claypool and K. Claypool. “Latency and player actions in online games”. In:Commun. ACM49.11 (Nov. 2006), pp. 40–45.doi:10 . 1145/1167838.1167860

arXiv 2006

[2] [2]

Caterini

J. Cordeiro, S. Noei, and Y. Zou.An Empirical Study on the Code Refactoring Capability of Large Language Models. 2024.doi:10.48550/ARXIV. 2411.02320

work page internal anchor Pith review doi:10.48550/arxiv 2024

[3] [3]

Large Language Models and Games: A Survey and Roadmap

R. Gallotta et al. “Large Language Models and Games: A Survey and Roadmap”. In:IEEE Trans. Games(2024), pp. 1–18.doi:10.1109/ TG.2024.3461510

arXiv 2024

[4] [4]

On the Effectiveness of Large Lan- guage Models in Domain-Specific Code Genera- tion

X. Gu et al. “On the Effectiveness of Large Lan- guage Models in Domain-Specific Code Genera- tion”. In:ACM Transactions on Software Engi- neering and Methodology34.3 (Mar. 2025), pp. 1– 22.doi:10.1145/3697012

work page doi:10.1145/3697012 2025

[5] [5]

https: 43 //doi.org/10.1145/3643991.3645072 .https://doi.org/10.1145/3643991.3645072

K. Jin et al. “Can ChatGPT Support Developers? AnEmpiricalEvaluationofLargeLanguageMod- els for Code Generation”. In:Proc. 21st Int. Conf. Mining Software Repositories. ACM, Apr. 2024, pp. 167–171.doi:10.1145/3643991.3645074

work page doi:10.1145/3643991.3645074 2024

[6] [6]

AI literacy and its implications for prompt engineering strategies

N. Knoth et al. “AI literacy and its implications for prompt engineering strategies”. In:Comput. Educ.: Artif. Intell.6 (June 2024), p. 100225.doi: 10.1016/j.caeai.2024.100225

work page doi:10.1016/j.caeai.2024.100225 2024

[7] [7]

Liu et al.An Empirical Study on the Poten- tial of LLMs in Automated Software Refactoring

B. Liu et al.An Empirical Study on the Poten- tial of LLMs in Automated Software Refactoring. 2024.doi:10.48550/ARXIV.2411.04444

work page doi:10.48550/arxiv.2411.04444 2024

[8] [8]

Guiding ChatGPT for Better Code Generation: An Empirical Study

C. Liu et al. “Guiding ChatGPT for Better Code Generation: An Empirical Study”. In:IEEE Int. Conf. Software Analysis, Evolution and Reengi- neering. IEEE, Mar. 2024, pp. 102–113.doi:10. 1109/SANER60148.2024.00018

arXiv 2024

[9] [9]

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Qual- ity Issues

Y. Liu et al. “Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Qual- ity Issues”. In:ACM Transactions on Software Engineering and Methodology33.5 (June 2024), pp. 1–26.doi:10.1145/3643674

work page doi:10.1145/3643674 2024

[10] [10]

No Need to Lift a Finger Any- more? Assessing the Quality of Code Generation by ChatGPT

Z. Liu et al. “No Need to Lift a Finger Any- more? Assessing the Quality of Code Generation by ChatGPT”. In:IEEE Transactions on Soft- ware Engineering50.6 (June 2024), pp. 1548– 1584.doi:10.1109/TSE.2024.3392499

work page doi:10.1109/tse.2024.3392499 2024

[11] [11]

Leverag- ing Large Language Models for Efficient Failure Analysis in Game Development

L. Marini, L. Gisslén, and A. Sestini. “Leverag- ing Large Language Models for Efficient Failure Analysis in Game Development”. In:IEEE Conf. Games (CoG). IEEE, Aug. 2024, pp. 1–8.doi: 10.1109/CoG60054.2024.10645540

work page doi:10.1109/cog60054.2024.10645540 2024

[12] [12]

Proof automation with large language models,

N. S. Mathews and M. Nagappan. “Test-Driven Development and LLM-based Code Generation”. In:Proc. 39th IEEE/ACM Int. Conf. Automated Software Engineering. ACM, Oct. 2024, pp. 1583– 1594.doi:10.1145/3691620.3695527

work page doi:10.1145/3691620.3695527 2024

[13] [13]

Nejjar, L

M. Nejjar et al. “LLMs for science: Usage for code generation and data analysis”. In:Journal of Software: Evolution and Process37.1 (Jan. 2025), e2723.doi:10.1002/smr.2723

work page doi:10.1002/smr.2723 2025

[14] [14]

OpenAI.GPT-4o System Card. Tech. rep.https: //openai.com/index/gpt- 4o- system- card/. OpenAI, 2024

2024

[15] [15]

Rasnayaka et al.An Empirical Study on Usage and Perceptions of LLMs in a Software Engineer- ing Project

S. Rasnayaka et al.An Empirical Study on Usage and Perceptions of LLMs in a Software Engineer- ing Project. 2024.doi:10.48550/ARXIV.2401. 16186

work page doi:10.48550/arxiv.2401 2024

[16] [16]

The Programmer’s Assistant: Conversational Interaction with a Large Lan- guage Model for Software Development

S. I. Ross et al. “The Programmer’s Assistant: Conversational Interaction with a Large Lan- guage Model for Software Development”. In: Proc. 28th Int. Conf. Intelligent User Interfaces. ACM, Mar. 2023, pp. 491–514.doi:10 . 1145 / 3581641.3584037

arXiv 2023

[17] [17]

Methodology for Code Synthesis Evaluation of LLMs Presented by a Case Study of ChatGPT and Copilot

Z. Ságodi, I. Siket, and R. Ferenc. “Methodology for Code Synthesis Evaluation of LLMs Presented by a Case Study of ChatGPT and Copilot”. In: IEEE Access12 (2024), pp. 72303–72316.doi: 10.1109/ACCESS.2024.3403858

work page doi:10.1109/access.2024.3403858 2024

[18] [18]

Sahoo et al.A Systematic Survey of Prompt Engineering in Large Language Models: Tech- niques and Applications

P. Sahoo et al.A Systematic Survey of Prompt Engineering in Large Language Models: Tech- niques and Applications. 2024.doi:10 . 48550 / ARXIV.2402.07927

Pith/arXiv arXiv 2024

[19] [19]

Refactoring

A. Shirafuji et al. “Refactoring Programs Us- ing Large Language Models with Few-Shot Ex- amples”. In:30th Asia-Pacific Softw. Eng. Conf. (APSEC). IEEE, Dec. 2023, pp. 151–160.doi: 10.1109/APSEC60848.2023.00025

work page doi:10.1109/apsec60848.2023.00025 2023

[20] [20]

Quality Assessment of Chat- GPT Generated Code and their Use by De- velopers

M. L. Siddiq et al. “Quality Assessment of Chat- GPT Generated Code and their Use by De- velopers”. In:Proceedings of the 21st Interna- tional Conference on Mining Software Reposito- ries. ACM, Apr. 2024, pp. 152–156.doi:10 . 1145/3643991.3645071

arXiv 2024

[21] [21]

https://docs.sonarsource.com/sonarqube- server/user- guide/code- metrics/metrics- definition

SonarQube.Understanding measures and met- rics | SonarQube Server | Sonar Documentation. https://docs.sonarsource.com/sonarqube- server/user- guide/code- metrics/metrics- definition. 2026

2026

[22] [22]

Sweetser.Large Language Models and Video Games: A Preliminary Scoping Review

P. Sweetser.Large Language Models and Video Games: A Preliminary Scoping Review. 2024. doi:10.48550/ARXIV.2403.02613

work page doi:10.48550/arxiv.2403.02613 2024

[23] [23]

GPT for Games: An Updated Scoping Review (2020- 2024)

D. Yang, E. Kleinman, and C. Harteveld. “GPT for Games: An Updated Scoping Review (2020- 2024)”. In:IEEE Transactions on Games(2025), pp. 1–16.doi:10.1109/TG.2025.3563780. 7

work page doi:10.1109/tg.2025.3563780 2020