Playing with Words, Improving with Rewards: Training Language Models for Creative Association
Pith reviewed 2026-06-29 13:42 UTC · model grok-4.3
The pith
Training LLMs on Codenames with verifiable rewards improves creativity in 8B models and reasoning in smaller ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We train Qwen3-1.7B, 4B, and 8B models on Codenames using RLVR and find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks.
What carries the argument
Reinforcement Learning with Verifiable Rewards (RLVR) applied to the Codenames game, whose objective outcomes allow training on divergent and convergent thinking without subjective scoring.
If this is right
- The 8B model improves on eight of ten creativity benchmarks with only minor reasoning degradation.
- The 1.7B and 4B models obtain substantial gains on reasoning tasks.
- RLVR on Codenames supplies a scalable method that bypasses human judgment for creativity training.
- The precision-diversity trade-off varies with model scale rather than remaining constant.
Where Pith is reading between the lines
- Verifiable games with similar structure could be used to target other cognitive skills beyond creativity.
- Training curricula might be chosen according to model size to emphasize either creativity or reasoning.
- Direct tests on open-ended creative tasks outside the benchmark set would clarify whether the reported gains generalize.
Load-bearing premise
That gains achieved on Codenames during RLVR training transfer to measurable improvements on ten separate creativity benchmarks that validly capture divergent and convergent thinking.
What would settle it
An experiment in which models trained on Codenames show no gains (or net losses) across the ten creativity benchmarks, or evidence that those benchmarks fail to measure the thinking axes exercised by the game.
read the original abstract
Large Language Models (LLMs) are being applied to increasingly difficult problems and use cases. To navigate their vast solution spaces effectively, LLMs need to be creative. Yet the subjective nature of creativity and the limits of human judgment make training LLMs for creativity especially challenging. As a solution, we train LLMs on Codenames, a word-association game that exercises the two central axes of creativity, divergent and convergent thinking, while yielding objectively verifiable outcomes. This verifiability lets us bypass human judgment and train with Reinforcement Learning with Verifiable Rewards (RLVR). We train Qwen3-1.7B, 4B, and 8B models and evaluate them on ten creativity and four reasoning benchmarks. We find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks. Our study presents a scalable and effective solution to train LLMs for creativity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that training Qwen3 models (1.7B, 4B, 8B) on the Codenames word-association game via Reinforcement Learning with Verifiable Rewards (RLVR) produces transferable gains on ten separate creativity benchmarks while exercising divergent and convergent thinking, with a scale-dependent precision-diversity trade-off: the 8B model improves on 8/10 creativity benchmarks with only minor reasoning degradation, whereas the smaller models show substantial reasoning gains at some cost to creativity.
Significance. If the empirical patterns hold after addressing the points below, the work supplies a concrete, objective alternative to subjective human judgment for creativity training by leveraging verifiable game outcomes. The multi-size evaluation and separation of training game from test benchmarks are strengths that allow direct falsification of the transfer claim. No machine-checked proofs or open code are mentioned, but the design targets reproducible benchmark scores.
major comments (3)
- [§4] §4 (Results): The claim of 'modest but consistent creativity gains (8 of 10 benchmarks)' for the 8B model is load-bearing for the central scale-dependent finding, yet no statistical significance tests, confidence intervals, or per-benchmark effect sizes are reported; without these it is impossible to distinguish signal from noise or to verify consistency across the ten benchmarks.
- [§3, §4.1] §3 (Methods) and §4.1 (Benchmark selection): The assertion that Codenames success transfers to measurable gains on the ten creativity benchmarks rests on the untested premise that those benchmarks validly operationalize the divergent/convergent axes exercised by the game; no ablation, correlation analysis, or validation against established creativity instruments is provided, directly undermining the interpretation of the reported transfer.
- [§4.2] §4.2 (Reasoning benchmarks): The smaller models are said to achieve 'substantial gains on reasoning tasks' while the 8B shows 'only minor reasoning degradation,' but the paper supplies neither the raw scores, baseline comparisons, nor error analysis needed to quantify the claimed trade-off or to rule out that the patterns are artifacts of the particular four reasoning benchmarks chosen.
minor comments (3)
- [§2] The abstract and introduction use 'precision-diversity trade-off' without an explicit definition or equation linking precision to the reasoning benchmarks and diversity to the creativity ones; this notation should be formalized in §2.
- [Figures/Tables] Table captions and axis labels in the benchmark result figures should include the exact metric (e.g., accuracy, F1) and whether higher or lower is better, to avoid ambiguity when comparing creativity versus reasoning columns.
- [§1] A short related-work subsection contrasting RLVR on Codenames with prior game-based or reward-model approaches to creativity would help situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below, committing to revisions that strengthen the statistical reporting and transparency while defending the benchmark selection on theoretical grounds from prior literature. All requested details can be incorporated in a revised version.
read point-by-point responses
-
Referee: [§4] §4 (Results): The claim of 'modest but consistent creativity gains (8 of 10 benchmarks)' for the 8B model is load-bearing for the central scale-dependent finding, yet no statistical significance tests, confidence intervals, or per-benchmark effect sizes are reported; without these it is impossible to distinguish signal from noise or to verify consistency across the ten benchmarks.
Authors: We agree that the absence of statistical tests limits interpretability. In the revised manuscript we will add paired statistical tests (t-tests or Wilcoxon signed-rank as appropriate), 95% confidence intervals, and per-benchmark effect sizes (Cohen’s d) computed across multiple random seeds for all ten creativity benchmarks. These additions will directly address concerns about signal versus noise and consistency. revision: yes
-
Referee: [§3, §4.1] §3 (Methods) and §4.1 (Benchmark selection): The assertion that Codenames success transfers to measurable gains on the ten creativity benchmarks rests on the untested premise that those benchmarks validly operationalize the divergent/convergent axes exercised by the game; no ablation, correlation analysis, or validation against established creativity instruments is provided, directly undermining the interpretation of the reported transfer.
Authors: Benchmark selection followed established mappings in the creativity literature that associate specific tasks with divergent versus convergent thinking. We will expand the methods and discussion sections to explicitly cite these mappings and add a correlation analysis between Codenames win rates and benchmark scores using the existing evaluation data. Full ablation or new human validation studies fall outside the current experimental scope and are noted as future work; the empirical transfer results remain falsifiable via the reported benchmark scores. revision: partial
-
Referee: [§4.2] §4.2 (Reasoning benchmarks): The smaller models are said to achieve 'substantial gains on reasoning tasks' while the 8B shows 'only minor reasoning degradation,' but the paper supplies neither the raw scores, baseline comparisons, nor error analysis needed to quantify the claimed trade-off or to rule out that the patterns are artifacts of the particular four reasoning benchmarks chosen.
Authors: We will add comprehensive tables in §4.2 and the appendix containing raw pre- and post-training scores for all three model sizes on the four reasoning benchmarks, together with baseline comparisons against the untuned Qwen3 models. We will also include an error analysis (e.g., category-wise breakdowns) and a brief discussion of benchmark limitations to better quantify the observed scale-dependent trade-off. revision: yes
Circularity Check
No significant circularity; empirical training and benchmark evaluation are self-contained
full rationale
The paper reports an empirical pipeline: RLVR training of Qwen3 models (1.7B/4B/8B) on Codenames followed by evaluation on ten separate creativity benchmarks and four reasoning benchmarks. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. All reported gains, trade-offs, and scale-dependent patterns are presented as direct experimental outcomes rather than derivations that reduce to the inputs by construction. The design therefore contains no load-bearing steps that qualify under the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell
Enhancing creativity in large language mod- els through associative thinking strategies.arXiv preprint arXiv:2405.06715. Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2023. Locally typical sampling.Transac- tions of the Association for Computational Linguis- tics, 11:102–121. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and ...
-
[2]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Wenjing Yang, Adam E Green, Qunlin Chen, Yoed N Kenett, Jiangzhou Sun, Dongtao Wei, and Jiang Qiu
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Creative problem solving in knowledge-rich contexts.Trends in Cognitive Sciences, 26(10):849– 859. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2026. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, ...
-
[4]
**Think** - Identify what the problem is ,→asking and outline your initial ,→approach
-
[5]
**Reason** - Work through the problem ,→step by step, keeping each step ,→focused and atomic
-
[6]
**Reflect** - Check your reasoning for ,→errors, gaps, or improvements
-
[7]
If no issues were ,→found, state that clearly
**Adjust** - If your reflection ,→identified an issue, explicitly ,→correct it here. If no issues were ,→found, state that clearly
-
[8]
,→No adjustments needed
**Output** - Provide your final answer. ,→This must be fully self-contained ,→and readable without any context ,→from the sections above. --- Use the following format exactly: --- <thinking> [What is the problem asking? What is your ,→initial approach?] </thinking> <reasoning> [Step-by-step reasoning. Each step should be ,→atomic and clearly follow from t...
-
[9]
Brainstorm several candidate clues
-
[10]
For each candidate, mentally check it ,→against every word in BOTH sets
-
[11]
Discard any candidate that has a ,→plausible link to a [Non-Target-Set] ,→word, even under uncommon meanings
-
[12]
Connection
Among the remaining candidates, choose ,→the one that connects to the most [ ,→Target-Set] words. --- Respond using exactly this format: [CODENAMES-CLUE-START] [Clue]: <single-word clue> [Selected-Targets]: <list of target words ,→your clue relates to, e.g., [WORD1, ,→WORD2]> [CODENAMES-CLUE-END] --- Here are the word sets: [Target-Set]: REFRACT, LENS [No...
-
[13]
For each word in [All-Words], assess ,→how strongly it connects to [Clue]
-
[14]
Consider ALL possible interpretations ,→of [Clue], the clue-giver may be ,→using an uncommon meaning, thematic ,→link, or lateral association
-
[15]
,→Include a word only if you believe ,→the clue-giver plausibly chose [Clue ,→] to point to it
Rank candidates by connection strength. ,→Include a word only if you believe ,→the clue-giver plausibly chose [Clue ,→] to point to it
-
[16]
Be especially cautious with lower- ,→ranked guesses, each additional ,→guess carries increasing risk of ,→selecting an unintended word. --- Respond using exactly this format: [CODENAMES-GUESS-START] [Guesses]: [first guess, second guess, ...] [CODENAMES-GUESS-END] --- Here are the word sets: [All-Words]: BLADE, FEEL, GARDEN, LOSS [Clue]: FROND [Max-Guesse...
2024
-
[17]
The core concept avoids ,→cliches, offering a unique ,→perspective or inventive treatment ,→of the theme
Originality & Imagination - 4 (Exceptional): The plot features ,→highly original, unpredictable ,→twists. The core concept avoids ,→cliches, offering a unique ,→perspective or inventive treatment ,→of the theme. - 3 (Proficient): Contains several ,→imaginative ideas, though the ,→premise or progression may follow ,→familiar narrative tropes. - 2 (Apprenti...
-
[18]
The ,→author uses vivid imagery and ,→figurative language to immerse the ,→reader in the story-world
Sensory Detail & Word Choice (Show, Don't ,→Tell) - 4 (Exceptional): Word choice is ,→sophisticated and deliberate. The ,→author uses vivid imagery and ,→figurative language to immerse the ,→reader in the story-world. - 3 (Proficient): Uses effective ,→descriptions to paint a clear ,→picture, though some moments may ,→rely on telling rather than showing. ...
-
[19]
The ,→conflict is compelling and drives an ,→organic narrative arc
Character & Conflict Development - 4 (Exceptional): Characters are ,→memorable, complex, and respond to ,→events in highly unique ways. The ,→conflict is compelling and drives an ,→organic narrative arc. - 3 (Proficient): Main characters are ,→adequately developed, and the ,→primary conflict is clear and easily ,→understood. - 2 (Apprentice): Characters l...
-
[20]
,→Remember, it's all in the timing
Narrative Arc & Pacing - 4 (Exceptional): Pacing is perfectly ,→managed within the 1000-word ,→constraint. Transitions between ,→scenes are seamless, leading to an ,→effective, satisfying resolution. - 3 (Proficient): Has a clear beginning, ,→middle, and end. Ideas flow ,→logically, though pacing may ,→occasionally rush or drag. - 2 (Apprentice): Difficul...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.