arxiv: 2603.09970 · v2 · submitted 2026-03-10 · 💻 cs.CL

Recognition: no theorem link

CREATE: Testing LLMs for Associative Creativity

Manya Wadhwa , Tiasa Singha Roy , Harvey Lederman , Junyi Jessy Li , Greg Durrett

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords associative creativityLLM benchmarkcreative reasoningconcept pathsspecificitydiversityfrontier modelshypothesis generation

0 comments

The pith

Frontier LLMs achieve higher creative utility on associative reasoning tasks than weaker models, but benchmark saturation remains difficult due to vast search spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CREATE, a benchmark for testing large language models on associative creativity through the generation of multiple paths that connect concepts drawn from their parametric knowledge. Paths earn credit when they show high specificity, defined as distinctiveness and closeness of the link, along with high diversity across the set, and models score better when they produce larger collections of strong paths. The design captures core demands of open-ended creative work such as hypothesis generation while supporting objective, scalable scoring. Evaluations of current frontier models show that the strongest ones deliver measurably higher creative utility, yet the enormous number of possible answers and search complexity keep the benchmark far from saturation. The results also indicate that models built for extended thinking or equipped with creative prompting do not reliably outperform others on this task.

Core claim

CREATE is a benchmark where models must generate sets of paths connecting concepts in their parametric knowledge, with paths evaluated for high specificity (distinctiveness and closeness) and high diversity (dissimilarity), and higher scores for larger sets of strong paths. Frontier models achieve higher creative utility, but the high multiplicity of answers and search complexity make saturation difficult. Thinking models are not always more effective, and creative prompting offers limited improvement.

What carries the argument

CREATE benchmark that scores creative utility by the quantity, specificity, and diversity of generated concept-connection paths.

If this is right

Stronger models will continue to show measurable advantages in producing larger sets of specific and diverse associative paths.
Extended thinking modes and higher token budgets will not guarantee superior performance on associative creativity tasks.
Existing creative prompting techniques will deliver only modest gains rather than transformative improvements.
The scale and multiplicity of valid answers will keep the benchmark resistant to saturation even with further model scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Associative creativity may function as a distinct capability from general reasoning chains or factual recall in current models.
CREATE could serve as a testbed for methods aimed at improving AI support for scientific hypothesis generation.
The benchmark's reliance on automated metrics creates an opening for hybrid human-AI validation loops to strengthen reliability.

Load-bearing premise

Automated scoring of specificity and diversity in generated paths reliably matches human judgments of creative associative reasoning.

What would settle it

A human rating study on a sample of model-generated paths that finds low correlation with the benchmark's automated specificity and diversity scores.

Figures

Figures reproduced from arXiv: 2603.09970 by Greg Durrett, Harvey Lederman, Junyi Jessy Li, Manya Wadhwa, Tiasa Singha Roy.

**Figure 1.** Figure 1: Motivating example of brainstorming paths in knowledge graphs. In CREATE, only the question is given; reasoning over the graph is implicit in the model’s parameters and thinking trace, similar to drawing connections for scientific research. Finding strong, distinct paths can be challenging. solving. A flurry of recent work aims to develop AI agents for these tasks; for example, for hypothesis generation (L… view at source ↗

**Figure 2.** Figure 2: Examples of model-generated paths u compared against population paths, along with quality scores and minimum distance values. The first and last connect artists through classic relations of directing, acting, performing, etc. The second path is the weakest according to the assessed specificity, because a connection through St. Louis is potentially shared by many entities. the entities [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 3.** Figure 3: Alternative prompting methods can lead to improvements depending on the model. Iterate and Resample interventions lead to the highest creative utility scores [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Creative utility vs patience for the frontier models, as well as prompt variations for GPT-5-mini. We see utility values being similar at lower patience but the difference increasing as the patience values increases strong quality and diversity, indicating the metric captures open-ended performance. Appendix G shows examples of raw predictions from GPT-5, Gemini-3-Pro, and ClaudeHaiku-4.5, along with exam… view at source ↗

**Figure 5.** Figure 5: This graph shows how the creative utility (patience=0.9) of a system changes when we include factuality in the objective. We see models trade off factuality for utility. At the most lenient, Gemini-3-Pro has the highest utility. However, at the strictest, GPT-5 is able to balance the two metrics better than other models. partially undermines the effectiveness of this search. 7.4. Quality vs Factuality In … view at source ↗

**Figure 6.** Figure 6: Pre-transformation, transformation function, and post-transformation cosine distance scores for GPT-5 (medium) paths. Many pairs being scored at a distance of 1 lined up with our intuitive assessment that they represented completely different relationships. B. Distance Function [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: LLM-as-a-judge performance for evaluating factuality of a generated path. Class Precision Recall F1-Score Support 0 0.52 0.94 0.67 72 1 0.98 0.77 0.87 274 The most common sources of errors for this factuality evaluator are, firstly, the model incorrectly identifies and assesses generations involving niche and long tail entities, which points to a knowledge gap. Secondly, the model often misinterprets rela… view at source ↗

**Figure 8.** Figure 8: Distribution of creative utility scores for two Gemini-3-Pro and GPT-5. Tables 9, 10 and 11 show examples of raw model outputs for a query. Claude tends to be very conservative when attempting our task, focusing more on verification of the information and hence giving fewer connections [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Example of a reasoning chain for a query ‘What are possible connections between David Koechner and someone who was born in Newport Beach?’ 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CREATE introduces a new benchmark for LLM associative creativity via path sets scored on multiplicity, specificity, and diversity, with frontier models ahead but automated metrics unvalidated against humans.

read the letter

The key takeaway is that CREATE gives a new way to benchmark LLMs on associative creativity by having them generate sets of paths between concepts, scored for how specific and diverse they are, and top models come out ahead. What stands out as new is the task design that combines path finding in a huge space with explicit scoring for multiplicity of good paths, their distinctiveness and closeness, plus dissimilarity across the set. Prior creativity evals exist, but this one aims for objective grading on an open-ended problem. The work does a solid job showing that frontier models score higher on creative utility, and that models with thinking or high token use don't always outperform simpler ones. It also notes that recent prompting tricks help only modestly. This makes the benchmark a useful testbed for new methods. The soft spot is the reliance on automated metrics for specificity and diversity. The paper doesn't report any human correlation or validation for these scores. Since the space is so large, it's possible the rules favor fluent but shallow associations that people wouldn't call creative, which might flip the model order. Adding even a quick human check would strengthen the claims. This is for people building or evaluating LLMs on creative reasoning tasks. Anyone interested in benchmarks that target hypothesis-like generation will get value from the protocol and the model comparisons. It deserves a serious referee because the core idea is fresh and the results are reproducible in principle, though the metrics need tightening. I would recommend sending it to peer review, with the main revision request being validation of the scoring against human judgments.

Referee Report

2 major / 2 minor

Summary. The paper introduces CREATE, a benchmark for LLMs' associative creativity that requires generating multiple paths connecting concepts drawn from parametric knowledge. Paths are scored on specificity (distinctiveness plus closeness) and diversity (dissimilarity across the set), with higher scores awarded for larger collections of strong paths. Frontier-model evaluations show that stronger models obtain higher creative-utility scores, yet the task's combinatorial scale prevents saturation; thinking models and recent creative-prompting methods yield only limited gains.

Significance. If the automated metrics prove to track human judgments of creative association, CREATE supplies a scalable, objective sandbox for a capability central to hypothesis generation and scientific reasoning. The benchmark's emphasis on multiplicity and open-ended search distinguishes it from narrower creativity tests and could usefully guide method development. The reported difficulty of saturation is a constructive signal that the task remains informative for current frontier systems.

major comments (2)

[Methods (scoring rules) and Results (model comparisons)] The specificity (distinctiveness + closeness) and diversity (path dissimilarity) metrics are introduced without any human validation study, inter-rater agreement statistics, or even a small correlation table against human ratings of associative creativity. Because model rankings and the headline claim rest entirely on these proxies, the absence of validation is load-bearing for the central result.
[Benchmark construction and Results] The paper states that the search space is 'extremely large' and that saturation is difficult, yet provides no quantitative characterization (e.g., effective branching factor, average path length distribution, or fraction of concept pairs that admit many high-scoring paths). Without such analysis it is hard to judge whether the observed model ordering reflects genuine creative capacity or sensitivity to the particular scoring thresholds chosen.

minor comments (2)

[Abstract and §4] The term 'thinking models' is used in the abstract and results but is not defined until the experimental setup; a brief parenthetical or footnote on first use would improve readability.
[Figures 2–4] Figure captions and axis labels for the creative-utility plots should explicitly state the number of paths sampled per model and the exact aggregation function used to compute the final score.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our CREATE benchmark. We address the major comments point by point below, agreeing where the manuscript can be strengthened and outlining specific revisions.

read point-by-point responses

Referee: [Methods (scoring rules) and Results (model comparisons)] The specificity (distinctiveness + closeness) and diversity (path dissimilarity) metrics are introduced without any human validation study, inter-rater agreement statistics, or even a small correlation table against human ratings of associative creativity. Because model rankings and the headline claim rest entirely on these proxies, the absence of validation is load-bearing for the central result.

Authors: We agree that direct validation against human judgments would strengthen the metrics' credibility. In the revised manuscript we will add a targeted human study on a subset of concept pairs and generated paths, reporting inter-rater agreement (e.g., Fleiss' kappa) and Pearson/Spearman correlations between our automated specificity and diversity scores and human ratings of associative creativity. This addition directly addresses the load-bearing concern while preserving the objective, scalable nature of the benchmark. revision: yes
Referee: [Benchmark construction and Results] The paper states that the search space is 'extremely large' and that saturation is difficult, yet provides no quantitative characterization (e.g., effective branching factor, average path length distribution, or fraction of concept pairs that admit many high-scoring paths). Without such analysis it is hard to judge whether the observed model ordering reflects genuine creative capacity or sensitivity to the particular scoring thresholds chosen.

Authors: We accept that additional quantitative characterization of the search space would improve interpretability. The revised manuscript will include an analysis section reporting: (1) effective branching factor derived from the concept graph, (2) distributions of path lengths among high-scoring paths, and (3) the fraction of sampled concept pairs that admit multiple high-specificity, high-diversity paths. These statistics will clarify that model differences arise from the task's combinatorial scale rather than arbitrary threshold choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark evaluation is self-contained empirical measurement

full rationale

The paper defines CREATE as a new benchmark task requiring generation of concept-connecting paths scored explicitly on author-specified criteria of specificity (distinctiveness and closeness) and diversity (dissimilarity), with higher scores for larger sets of strong paths. Model results are obtained by applying these fixed rules to outputs from frontier LLMs; no equations reduce the final creative-utility ranking to a fitted parameter, prior self-citation, or input by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to force the outcome. The reported finding that stronger models score higher is a direct, non-tautological computation on the open-ended task, consistent with standard new-benchmark evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the automated metrics for specificity and diversity align with human notions of creative utility and that the concept graph derived from parametric knowledge is sufficiently rich; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Automated scoring of path specificity and diversity captures meaningful creative utility
Invoked when claiming models achieve higher creative utility; no human validation mentioned in abstract

pith-pipeline@v0.9.0 · 5496 in / 1222 out tokens · 31446 ms · 2026-05-15T13:02:36.438569+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GIANTS: Generative Insight Anticipation from Scientific Literature
cs.CL 2026-04 unverdicted novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Anderson, B

URL https://openreview.net/forum? id=kJqTkj2HhF. Anderson, B. R., Shah, J. H., and Kreminski, M. Homog- enization Effects of Large Language Models on Human Creative Ideation. InProceedings of the 16th conference on creativity & cognition, pp. 413–425, 2024. Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., and Krathwohl, D. R.Taxonomy of educatio...

work page doi:10.18653/v1/2025.naacl-long 2024
[2]

naacl-long.346/

URL https://aclanthology.org/2025. naacl-long.346/. Lin, H. and Bilmes, J. A Class of Submodular Functions for Document Summarization. In Lin, D., Matsumoto, Y ., and Mihalcea, R. (eds.),Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 510–520, Portland, Oregon, USA, June 2011. Assoc...

work page arXiv 2025
[3]

3f5693bc-12fa-45cb-b4ca-e99e2cbe5b8a Generated by DeepScientist Structured AI Review Report Page 23

URL https://api.semanticscholar. org/CorpusID:271854887. Lu, Y ., Wang, D., Li, T., Jiang, D., Khudanpur, S., Jiang, M., and Khashabi, D. Benchmarking Language Model Creativity: A Case Study on Code Generation. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tec...

work page arXiv 2025
[4]

URL https://openreview.net/forum? id=Hi0SyHMmkd. Olmo, T., Ettinger, A., Bertsch, A., Kuehl, B., Graham, D., Heineman, D., Groeneveld, D., Brahman, F., Timbers, F., Ivison, H., Morrison, J., Poznanski, J., Lo, K., Soldaini, L., Jordan, M., Chen, M., Noukhovitch, M., Lambert, N., Walsh, P., Dasigi, P., Berry, R., Malik, S., Shah, S., Geng, S., Arora, S., G...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Identify Class A from (p, o)

work page
[6]

Identify Class B from (s, p)

work page
[7]

Use factual knowledge and reasonable estimation to estimate the size of EACH class

work page
[8]

Select the LARGER of the two estimated class sizes

work page
[9]

explanation

Explain your reasoning step by step for both classes. Output instructions: - Return valid JSON only. - The output must be a JSON list, with one object per triple, in the same order as the path. - Each object must contain exactly two keys: - “explanation”: a natural-language explanation for that triple’s Class A and Class B reasoning. - “judgment”: a singl...

work page 1968
[10]

(Need to find X)

David Koechner is a co-star with [X], who was born in Newport Beach. (Need to find X)

work page
[11]

David Koechner is part of a production company that has a connection to someone born in Newport Beach

work page
[12]

David Koechner attended an event where a participant was born in Newport Beach

work page
[13]

David Koechner is a friend of someone who was born in Newport Beach

work page
[14]

David Koechner's work in a movie/TV show was filmed in Newport Beach, and someone involved in that production was born there

work page
[15]

Anchorman,

David Koechner is part of a charity organization that has a connection to Newport Beach, leading to someone born there. But since I don't have specific data on X, these are all speculative. However, for the purpose of the task, I need to generate as many as possible, even if they are weaker. Let me try to think of any possible X. For example, if David Koe...

work page
[16]

David Koechner - co-star - [Unknown Actor X] - place of birth - Newport Beach

work page
[17]

David Koechner - worked on - [Movie/Show Y] - filmed in - Newport Beach - location scout - [Person Z] - place of birth - Newport Beach

work page
[18]

David Koechner - friend - [Person A] - place of birth - Newport Beach

work page
[19]

David Koechner - member of - [Organization B] - member - [Person C] - place of birth - Newport Beach

work page
[20]

But since I don't have concrete entities for X, Y, Z, A, B, C, D, E, these are placeholders

David Koechner - participated in - [Event D] - held in - Newport Beach - attendee - [Person E] - place of birth - Newport Beach. But since I don't have concrete entities for X, Y, Z, A, B, C, D, E, these are placeholders. However, the task requires concrete entities. Therefore, I need to find real people or entities. Figure 9.Example of a reasoning chain ...

work page
[21]

- Use the path context only when necessary to disambiguate entities

Entity existence and disambiguation: - Verify that both the subject and object are real, identifiable entities (people, places, organizations, events, or concepts). - Use the path context only when necessary to disambiguate entities. - If either entity is fabricated or incorrectly identified, mark the triple as hallucinated

work page
[22]

- If the relationship is plausible and could reasonably have occurred, even if not directly verifiable, mark it as not hallucinated

Relationship validity: - Verify that the stated relation describes a real, historically valid, currently valid, or plausibly valid relationship between the subject and object. - If the relationship is plausible and could reasonably have occurred, even if not directly verifiable, mark it as not hallucinated. - If it is niche information about the subject a...

work page
[23]

Hallucination detection: - Mark the triple as hallucinated if: - An entity is fabricated or does not exist - The relation is fabricated, nonsensical, or incorrect - The asserted connection clearly contradicts well-known facts - The triple introduces made-up properties or roles

work page
[24]

explanation

Directionality handling: - If the relation is ambiguous or commonly bidirectional, evaluate both directions. - If either direction corresponds to a true or plausible relationship, mark the triple as not hallucinated. Output instructions: - Return valid JSON only. - Include exactly two keys: “explanation” and “judgments”. - Under “explanation”, provide a n...

work page 2021
[25]

Kareem Abdul −Jabbar is a legendary basketball player 2. I need to find paths that connect him to people who are members of the American Academy of Arts and Sciences (AAAS) Some people I know who are members of AAAS: - Many academics, artists, scientists - Possibly some celebrities who have also become academically distinguished Let me think about Kareem ...

work page
[26]

‘json { “1

Kareem wrote books - could he have a publisher or editor in AAAS? 3. Kareem could have worked with coaches or teammates 4. Kareem could have connections through cultural/civil rights work Let me think of specific people in AAAS: - Spike Lee (filmmaker) - might be in AAAS? I should verify. Actually, I’m not certain about this. - Bill Cosby appeared with Ka...

work page 1905
[27]

relation

**Scan the entire reasoning trace** and extract: - All relation phrases→“relation” - All entity names or conceptual entities→“entities”

work page
[28]

Generate all entities and relations that appear in the trace as they appear in the trace

**Do not deduplicate** items within each list. Generate all entities and relations that appear in the trace as they appear in the trace. Do not remove duplicates

work page
[29]

Normalize entity and relation names to lowercase, remove extra spaces, and remove punctuation

work page
[30]

**Do not infer facts** that are not mentioned or clearly implied

work page
[31]

**Return only valid JSON**, with: - Double quotes around strings - Arrays for both keys - No other output ======================== ### 4. Input to process ======================== {trace} 30 CREATE: Testing LLMs for Associative Creativity Table 17.Human creativity test, example, number of instances, and LLM performance on these tests. We only look at aver...

work page 2025