Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs

Daniel Dyrda; Farhan Abid Ivan; Georg Groh; Julian Geheeb; Miriam Ansch\"utz

arxiv: 2509.24730 · v1 · pith:EFW5RYG2new · submitted 2025-09-29 · 💻 cs.HC

Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs

Julian Geheeb , Farhan Abid Ivan , Daniel Dyrda , Miriam Ansch\"utz , Georg Groh This is my paper

Pith reviewed 2026-05-21 21:56 UTC · model grok-4.3

classification 💻 cs.HC

keywords game designlarge language modelsmedium-sized LLMsgame conceptsLLM feedbackDeepSeek-R1pilot studycreative workflows

0 comments

The pith

Medium-sized LLMs like DeepSeek-R1 can provide useful feedback on game concepts using ten key aspects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how medium-sized LLMs that run on consumer hardware can support the early stages of game design. Authors defined ten aspects of a strong game concept, generated sample ideas, and had three models critique them. DeepSeek-R1 gave the most consistent and helpful responses. A pilot study with ten students showed they rated the feedback highly and wanted to use it in their projects. This approach could help small teams or individuals iterate on ideas without expensive resources.

Core claim

The central finding is that prompting medium-sized LLMs to assess game ideas according to ten key aspects yields valuable refinements, with DeepSeek-R1 performing best, as validated by researcher comparison and positive reception in a student pilot study.

What carries the argument

The ten key aspects of a strong game concept, which the authors use as structured criteria for the LLMs to evaluate and improve upon sample ideas.

Load-bearing premise

The ten key aspects identified by the authors are a sufficient and appropriate basis for judging the quality of game concepts.

What would settle it

If expert game designers rate concepts improved by the LLM feedback as no better than the original ideas or human-only revisions in a blind test, this would challenge the utility of the approach.

Figures

Figures reproduced from arXiv: 2509.24730 by Daniel Dyrda, Farhan Abid Ivan, Georg Groh, Julian Geheeb, Miriam Ansch\"utz.

**Figure 1.** Figure 1: Depiction of the prompts and workflow used to generate the test dataset. The left prompt has additional options (1) and (2) used to refine the process. 4.1.1. Game Idea Dataset Creation To evaluate the capabilities of different language models and enable consistent comparisons, we first created a dataset of game ideas with varying levels of descriptive detail. We used OpenAI’s GPT-4o [14], accessed through… view at source ↗

**Figure 2.** Figure 2: Prompt used to generate structured evaluation outputs for all models. The placeholder <Details on the aspects> corresponds to content adapted from section 2, which has been omitted here for brevity. Full details are available upon request. ideas from subsubsection 4.1.1, we employed the Hugging Face Text Generation Inference Docker5 . This environment streamlined inference across various open-source LLMs, … view at source ↗

**Figure 3.** Figure 3: Screenshot of the SPARC frontend shown to participants in the study. After a text file is uploaded, the response appears at the bottom once processing is complete. The user study took place during the early phase of the course, when teams had just developed their initial game concepts but had not yet started implementation. Participation was voluntary. In return for their time, participants received format… view at source ↗

**Figure 4.** Figure 4: Answer distributions for the pilot study questions. Each question is shown in the caption of the corresponding subfigure. 5.3. Results The results of the closed-ended questions are shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Recent research has demonstrated that large language models (LLMs) can support experts across various domains, including game design. In this study, we examine the utility of medium-sized LLMs, models that operate on consumer-grade hardware typically available in small studios or home environments. We began by identifying ten key aspects that contribute to a strong game concept and used ChatGPT to generate thirty sample game ideas. Three medium-sized LLMs, LLaMA 3.1, Qwen 2.5, and DeepSeek-R1, were then prompted to evaluate these ideas according to the previously identified aspects. A qualitative assessment by two researchers compared the models' outputs, revealing that DeepSeek-R1 produced the most consistently useful feedback, despite some variability in quality. To explore real-world applicability, we ran a pilot study with ten students enrolled in a storytelling course for game development. At the early stages of their own projects, students used our prompt and DeepSeek-R1 to refine their game concepts. The results indicate a positive reception: most participants rated the output as high quality and expressed interest in using such tools in their workflows. These findings suggest that current medium-sized LLMs can provide valuable feedback in early game design, though further refinement of prompting methods could improve consistency and overall effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Medium LLMs give plausible feedback on game concepts in a small student pilot, but subjective self-ratings without controls leave the real utility unproven.

read the letter

The main point is that medium-sized models like DeepSeek-R1 can produce feedback on early game ideas that students find helpful, yet the pilot does not show whether that feedback actually strengthens the concepts or just feels new. The authors define ten aspects of a solid game concept, generate thirty examples with ChatGPT, and have three models rate them. Their qualitative check finds DeepSeek-R1 the most consistent. Ten students then used the prompt on their own projects and mostly rated the output high quality while saying they would try similar tools again. This setup targets consumer hardware, which fits small studios or classes. The student test adds a practical data point that prior LLM work on games often skips. The positive reception is a fair observation for an early exploration. The weaknesses sit in the evaluation design. The student ratings are self-reported interest and quality scores with no expert pre/post comparison, no control condition, and no measure of actual concept improvement. Ten participants is a thin base for claims about workflows, and the ten aspects come from the authors without external validation or reliability checks. The model comparison also stays qualitative with little detail on how the two researchers resolved differences. These gaps make the utility claim reasonable but not yet solid. This paper suits readers in game design education, HCI, or applied AI for creativity who want a grounded example of medium models in a real task. It shows honest engagement with a concrete setting and reports participant input, so it clears the bar for serious refereeing even if the evidence needs more structure. I would send it to peer review. The core application is sensible and the pilot gives something to build on, but reviewers should press for objective measures and clearer validation of the evaluation criteria.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the utility of medium-sized LLMs (LLaMA 3.1, Qwen 2.5, DeepSeek-R1) for evaluating and refining early-stage game concepts. The authors first identify ten key aspects of strong game concepts, generate 30 sample ideas via ChatGPT, qualitatively compare the models' feedback on those ideas (finding DeepSeek-R1 most consistent), and then conduct a pilot study in which 10 students from a game-development storytelling course used DeepSeek-R1 to refine their own concepts, reporting high-quality outputs and interest in workflow integration.

Significance. If the central claim holds under stronger evaluation, the work would show that consumer-grade LLMs can deliver practical early-stage feedback for game ideation in small studios or educational settings, lowering barriers to AI-assisted design without requiring large-scale infrastructure.

major comments (2)

[Pilot study] Pilot study description: the evaluation rests on subjective self-ratings of 'high quality' and workflow interest from only ten participants, with no control arm, no pre/post expert ratings of concept strength, no objective metrics, and no definition of 'high quality' beyond the unvalidated ten aspects. This leaves open alternative explanations such as novelty bias or demand characteristics and does not directly support the claim of demonstrable value.
[Qualitative assessment] Qualitative comparison of model outputs: the determination that DeepSeek-R1 produced the 'most consistently useful feedback' is based on assessment by two researchers without reported inter-rater reliability, explicit scoring rubric, or quantitative summary of variability across the 30 ideas.

minor comments (2)

[Introduction] The ten key aspects are introduced without justification for their completeness or derivation; a brief rationale or reference to prior game-design literature would strengthen the foundation.
[Methods] The manuscript would benefit from explicit discussion of prompting strategies and any observed inconsistencies in model outputs to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with honest clarifications regarding the pilot study's exploratory scope and the qualitative assessment process. Revisions have been made to enhance transparency and explicitly discuss limitations where appropriate.

read point-by-point responses

Referee: [Pilot study] Pilot study description: the evaluation rests on subjective self-ratings of 'high quality' and workflow interest from only ten participants, with no control arm, no pre/post expert ratings of concept strength, no objective metrics, and no definition of 'high quality' beyond the unvalidated ten aspects. This leaves open alternative explanations such as novelty bias or demand characteristics and does not directly support the claim of demonstrable value.

Authors: We agree that the pilot study relies on subjective self-reports from a small sample of ten students and lacks a control condition, pre/post expert ratings, or objective metrics. As an exploratory pilot focused on feasibility in an educational game-development context, its goal was to gather initial impressions of workflow integration rather than establish causal value. In the revised manuscript we have added an expanded limitations section that directly addresses potential novelty bias, demand characteristics, the absence of controls, and the fact that the ten aspects were derived from prior literature without formal validation. We also clarify that 'high quality' was operationalized relative to those aspects and participants' reported usefulness. While we cannot retroactively add a control arm or expert ratings, we outline specific directions for future controlled studies. This constitutes a partial revision focused on improved reporting. revision: partial
Referee: [Qualitative assessment] Qualitative comparison of model outputs: the determination that DeepSeek-R1 produced the 'most consistently useful feedback' is based on assessment by two researchers without reported inter-rater reliability, explicit scoring rubric, or quantitative summary of variability across the 30 ideas.

Authors: The comparison was conducted by two authors who jointly reviewed outputs for consistency and usefulness against the ten aspects. We accept that the absence of reported inter-rater reliability, a formal rubric, and quantitative variability metrics reduces transparency. The revised manuscript now includes a dedicated methods paragraph describing the assessment criteria (specificity, actionability, and coverage of the ten aspects) and the consensus-reaching process. We have added a quantitative summary noting the proportion of the 30 ideas on which each model performed best or showed inconsistency, along with representative examples placed in an appendix. These changes strengthen the description without changing the original qualitative conclusion. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pilot study with direct ratings

full rationale

The manuscript presents an empirical user study: authors identify ten aspects, generate ideas via ChatGPT, prompt three LLMs to evaluate them, perform qualitative comparison of outputs, and run a pilot with ten students who rate the tool's output and express workflow interest. No equations, fitted parameters, predictions, or derivations appear. Results are reported directly from model outputs and participant self-ratings rather than being forced by construction or reduced to prior self-citations. The central claims rest on observable study data, not on any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study depends on domain assumptions about what makes a good game concept and standard practices for qualitative HCI evaluation; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption Ten key aspects contribute to a strong game concept
These aspects form the basis for all LLM prompts and evaluations.

pith-pipeline@v0.9.0 · 5779 in / 1126 out tokens · 33193 ms · 2026-05-21T21:56:39.200025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

C. M. Kanode, H. M. Haddad, Software engineering challenges in game development, in: 2009 Sixth International Conference on Information Technology: New Generations, IEEE, 2009, pp. 260–265

work page 2009
[2]

Z. A. Nazi, W. Peng, Large language models in healthcare and medical domain: A review, in: Informatics, volume 11, MDPI, 2024, p. 57

work page 2024
[3]

X. Luo, A. Rechardt, G. Sun, K. K. Nejad, F. Yáñez, B. Yilmaz, K. Lee, A. O. Cohen, V. Borghesani, A. Pashkov, et al., Large language models surpass human experts in predicting neuroscience results, Nature human behaviour 9 (2025) 305–315

work page 2025
[4]

P. L. Lanzi, D. Loiacono, Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design, in: Proceedings of the Genetic and Evolutionary Computation Conference, 2023, pp. 1383–1390

work page 2023
[5]

K. S. Tekinbas, E. Zimmerman, Rules of play: Game design fundamentals, MIT press, 2003

work page 2003
[6]

Schell, The Art of Game Design: A book of lenses, CRC press, 2008

J. Schell, The Art of Game Design: A book of lenses, CRC press, 2008

work page 2008
[7]

Galuzin, Preproduction Blueprint: How to Plan Game Environments and Level Designs, Cre- ateSpace Independent Publishing Platform, 2016

A. Galuzin, Preproduction Blueprint: How to Plan Game Environments and Level Designs, Cre- ateSpace Independent Publishing Platform, 2016

work page 2016
[8]

C. W. Totten, Level design: Processes and experiences, CRC Press, 2017

work page 2017
[9]

Fullerton, Game design workshop: a playcentric approach to creating innovative games, AK Peters/CrC Press, 2024

T. Fullerton, Game design workshop: a playcentric approach to creating innovative games, AK Peters/CrC Press, 2024

work page 2024
[10]

Yang, Level design book, 2020

R. Yang, Level design book, 2020. URL: https://www.leveldesignbook.com/, accessed: 2025-06-30

work page 2020
[11]

URL: https://ai.meta.com/blog/ meta-llama-3-1/, accessed: 2025-06-30

Meta AI, Meta llama 3.1: Advancing open foundation models, 2025. URL: https://ai.meta.com/blog/ meta-llama-3-1/, accessed: 2025-06-30

work page 2025
[12]

Qwen2 Technical Report

Q. Team, Qwen2 technical report, arXiv preprint arXiv:2407.10671 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al., Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al., Gpt-4o system card, arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

J. Xu, Z. Li, W. Chen, Q. Wang, X. Gao, Q. Cai, Z. Ling, On-device language models: A comprehen- sive review, arXiv preprint arXiv:2409.00088 (2024)

work page arXiv 2024
[16]

URL: https://developer.android.com/ ai/gemini-nano#gboard-smart, accessed: 2025-06-30

Android Developers, Gemini nano | android developers, 2024. URL: https://developer.android.com/ ai/gemini-nano#gboard-smart, accessed: 2025-06-30

work page 2024
[17]

Biomistral: A collection of open-source pretrained large language models for medical domains

Y. Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, R. Dufour, Biomistral: A collec- tion of open-source pretrained large language models for medical domains, arXiv preprint arXiv:2402.10373 (2024)

work page arXiv 2024
[18]

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, et al., A survey on llm-as-a-judge, arXiv preprint arXiv:2411.15594 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y. Jiang, C. Chen, T. Wu, et al., From generation to judgment: Opportunities and challenges of llm-as-a-judge, arXiv preprint arXiv:2411.16594 (2024)

work page arXiv 2024
[20]

Tucek, K

T. Tucek, K. Harshina, G. Samaritaki, D. Rajesh, One spell fits all: A generative ai game as a tool for research in ai creativity and sustainable design (2024)

work page 2024
[21]

Hutson, B

J. Hutson, B. Fulcher, J. Ratican, Enhancing assessment and feedback in game design programs: Leveraging generative ai for efficient and meaningful evaluation, International Journal of Educa- tional Research and Innovation (2024)

work page 2024
[22]

Gallotta, G

R. Gallotta, G. Todd, M. Zammit, S. Earle, A. Liapis, J. Togelius, G. N. Yannakakis, Large language models and games: A survey and roadmap, IEEE Transactions on Games (2024)

work page 2024
[23]

Sweetser, Large language models and video games: A preliminary scoping review, in: Proceed- ings of the 6th ACM Conference on Conversational User Interfaces, 2024, pp

P. Sweetser, Large language models and video games: A preliminary scoping review, in: Proceed- ings of the 6th ACM Conference on Conversational User Interfaces, 2024, pp. 1–8

work page 2024
[24]

Begemann, J

A. Begemann, J. Hutson, Empirical insights into ai-assisted game development: A case study on the integration of generative ai tools in creative pipelines, Metaverse 5 (2024)

work page 2024
[25]

L. Long, C. Xinyi, W. Ruoyu, L. Toby Jia-Jun, L. Ray, Sketchar: Supporting character design and illustration prototyping using generative ai, Proceedings of the ACM on Human-Computer Interaction 8 (2024) 337

work page 2024
[26]

Lee, S.-Y

J. Lee, S.-Y. Eom, J. Lee, Empowering game designers with generative ai, IADIS International Journal on Computer Science & Information Systems 18 (2023) 213–230

work page 2023

[1] [1]

C. M. Kanode, H. M. Haddad, Software engineering challenges in game development, in: 2009 Sixth International Conference on Information Technology: New Generations, IEEE, 2009, pp. 260–265

work page 2009

[2] [2]

Z. A. Nazi, W. Peng, Large language models in healthcare and medical domain: A review, in: Informatics, volume 11, MDPI, 2024, p. 57

work page 2024

[3] [3]

X. Luo, A. Rechardt, G. Sun, K. K. Nejad, F. Yáñez, B. Yilmaz, K. Lee, A. O. Cohen, V. Borghesani, A. Pashkov, et al., Large language models surpass human experts in predicting neuroscience results, Nature human behaviour 9 (2025) 305–315

work page 2025

[4] [4]

P. L. Lanzi, D. Loiacono, Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design, in: Proceedings of the Genetic and Evolutionary Computation Conference, 2023, pp. 1383–1390

work page 2023

[5] [5]

K. S. Tekinbas, E. Zimmerman, Rules of play: Game design fundamentals, MIT press, 2003

work page 2003

[6] [6]

Schell, The Art of Game Design: A book of lenses, CRC press, 2008

J. Schell, The Art of Game Design: A book of lenses, CRC press, 2008

work page 2008

[7] [7]

Galuzin, Preproduction Blueprint: How to Plan Game Environments and Level Designs, Cre- ateSpace Independent Publishing Platform, 2016

A. Galuzin, Preproduction Blueprint: How to Plan Game Environments and Level Designs, Cre- ateSpace Independent Publishing Platform, 2016

work page 2016

[8] [8]

C. W. Totten, Level design: Processes and experiences, CRC Press, 2017

work page 2017

[9] [9]

Fullerton, Game design workshop: a playcentric approach to creating innovative games, AK Peters/CrC Press, 2024

T. Fullerton, Game design workshop: a playcentric approach to creating innovative games, AK Peters/CrC Press, 2024

work page 2024

[10] [10]

Yang, Level design book, 2020

R. Yang, Level design book, 2020. URL: https://www.leveldesignbook.com/, accessed: 2025-06-30

work page 2020

[11] [11]

URL: https://ai.meta.com/blog/ meta-llama-3-1/, accessed: 2025-06-30

Meta AI, Meta llama 3.1: Advancing open foundation models, 2025. URL: https://ai.meta.com/blog/ meta-llama-3-1/, accessed: 2025-06-30

work page 2025

[12] [12]

Qwen2 Technical Report

Q. Team, Qwen2 technical report, arXiv preprint arXiv:2407.10671 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al., Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al., Gpt-4o system card, arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

J. Xu, Z. Li, W. Chen, Q. Wang, X. Gao, Q. Cai, Z. Ling, On-device language models: A comprehen- sive review, arXiv preprint arXiv:2409.00088 (2024)

work page arXiv 2024

[16] [16]

URL: https://developer.android.com/ ai/gemini-nano#gboard-smart, accessed: 2025-06-30

Android Developers, Gemini nano | android developers, 2024. URL: https://developer.android.com/ ai/gemini-nano#gboard-smart, accessed: 2025-06-30

work page 2024

[17] [17]

Biomistral: A collection of open-source pretrained large language models for medical domains

Y. Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, R. Dufour, Biomistral: A collec- tion of open-source pretrained large language models for medical domains, arXiv preprint arXiv:2402.10373 (2024)

work page arXiv 2024

[18] [18]

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, et al., A survey on llm-as-a-judge, arXiv preprint arXiv:2411.15594 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y. Jiang, C. Chen, T. Wu, et al., From generation to judgment: Opportunities and challenges of llm-as-a-judge, arXiv preprint arXiv:2411.16594 (2024)

work page arXiv 2024

[20] [20]

Tucek, K

T. Tucek, K. Harshina, G. Samaritaki, D. Rajesh, One spell fits all: A generative ai game as a tool for research in ai creativity and sustainable design (2024)

work page 2024

[21] [21]

Hutson, B

J. Hutson, B. Fulcher, J. Ratican, Enhancing assessment and feedback in game design programs: Leveraging generative ai for efficient and meaningful evaluation, International Journal of Educa- tional Research and Innovation (2024)

work page 2024

[22] [22]

Gallotta, G

R. Gallotta, G. Todd, M. Zammit, S. Earle, A. Liapis, J. Togelius, G. N. Yannakakis, Large language models and games: A survey and roadmap, IEEE Transactions on Games (2024)

work page 2024

[23] [23]

Sweetser, Large language models and video games: A preliminary scoping review, in: Proceed- ings of the 6th ACM Conference on Conversational User Interfaces, 2024, pp

P. Sweetser, Large language models and video games: A preliminary scoping review, in: Proceed- ings of the 6th ACM Conference on Conversational User Interfaces, 2024, pp. 1–8

work page 2024

[24] [24]

Begemann, J

A. Begemann, J. Hutson, Empirical insights into ai-assisted game development: A case study on the integration of generative ai tools in creative pipelines, Metaverse 5 (2024)

work page 2024

[25] [25]

L. Long, C. Xinyi, W. Ruoyu, L. Toby Jia-Jun, L. Ray, Sketchar: Supporting character design and illustration prototyping using generative ai, Proceedings of the ACM on Human-Computer Interaction 8 (2024) 337

work page 2024

[26] [26]

Lee, S.-Y

J. Lee, S.-Y. Eom, J. Lee, Empowering game designers with generative ai, IADIS International Journal on Computer Science & Information Systems 18 (2023) 213–230

work page 2023