pith. sign in

arxiv: 2606.11762 · v1 · pith:KDRQ23PJnew · submitted 2026-06-10 · 💻 cs.CL · cs.AI

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

Pith reviewed 2026-06-27 09:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM creativity evaluationsemantic entropytask-agnostic assessmentdivergent creativityconvergent creativitymulti-agent judgeautomated evaluationopen-ended tasks
0
0 comments X

The pith

An automated framework evaluates LLM creativity across any open-ended task by separating measurement from the task itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a general method to assess creativity in language models that applies equally to problem-solving, research ideation, and creative writing. It measures divergent aspects like novelty and diversity with semantic entropy, a reference-free approach, and convergent task fulfilment with a retrieval-based multi-agent judge. This separation of the evaluation tools from the specific task enables the same apparatus to work across domains without custom rules for each one. A reader would care because prior creativity metrics embed assumptions tied to one domain, which blocks scalable comparison and progress tracking. The framework also reports how model size, temperature, recency, and reasoning shape creative outputs.

Core claim

The central claim is that creativity in LLMs can be quantified in a task-agnostic way by combining semantic entropy for novelty and diversity with a retrieval-based multi-agent judge for context-sensitive task fulfilment, and that this combination reliably tracks these facets while exposing effects of model properties across problem-solving, ideation, and writing tasks.

What carries the argument

The domain-agnostic framework that decouples the measurement apparatus from the creative task, using semantic entropy to quantify divergent creativity and a retrieval-based multi-agent judge to assess convergent task fulfilment.

If this is right

  • The same metrics apply to new open-ended tasks without redesign.
  • Model size, temperature, recency, and reasoning produce measurable differences in creative performance.
  • The multi-agent judge reduces evaluation time by over 60 percent compared with prior methods.
  • Semantic entropy aligns with human annotations and existing diversity baselines.
  • The approach supports reproducible benchmarking of creative capabilities across models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of measurement from task could extend the same metrics to evaluate creativity in code generation or image captioning without new tuning.
  • Large-scale comparisons of creative output across dozens of models become feasible if the metrics remain stable.
  • Training objectives might target the specific facets tracked by entropy and the judge to improve selected aspects of creativity.

Load-bearing premise

Semantic entropy supplies a robust reference-free metric for novelty and diversity that matches human judgments, and the retrieval-based multi-agent judge accurately evaluates task fulfilment.

What would settle it

Human raters show low correlation with semantic entropy scores on novelty for a new collection of model outputs, or expert assessments of task fulfilment diverge from the multi-agent judge scores in an additional domain.

Figures

Figures reproduced from arXiv: 2606.11762 by Alvin Chan, Min Sen Tan, Mohor Banerjee, Nadya Yuki Wangsajaya, Swaagat Bikash Saikia, Syed Ali Redha Alsagoff, Zachary Kit Chun Choy.

Figure 1
Figure 1. Figure 1: Divergent Creativity. LLM-generated steps clustered by similarity, with entropy computed over cluster [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Retrieval-Based Multi-Agent Framework. Left: Three specialised LLM agents—Problem, Solution, and Criterion—analyze tasks from different perspectives, recording insights in "fragments". Middle: Fragments are embedded in a vector database; each agent retrieves the k most relevant fragments via cosine similarity at their turn. Right: This retrieval loop cuts token usage by ≈63% compared to ChatEval while conv… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our benchmark, for all 3 datasets. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Semantic Entropy analysis. (a): Step-level SE against no. of semantic clusters formed at each step. (b): Average step-level SE under different sampling temperatures. (c): Step-level SE vs. average pairwise cosine similarity among sampled candidates. (d): Solution-level SE versus LLM-judged novelty rankings of final solutions. (T) and (NT) denote Thinking and Non-Thinking model variants, respectively. 6 Res… view at source ↗
Figure 5
Figure 5. Figure 5: The impact of various parameters (left: model size, center: model recency, right: reasoning capabilities) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Semantic Entropy compared to different convergent creativity metrics (Y-axis) on the MacGyver dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The correlation between LLMJudge novelty [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of our discussion framework at different confidence thresholds for early exit. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Using the flip adds a constraint to the LLM and rigorously tests its divergent creativity - it becomes more [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Semantic Entropy compared to different convergent creativity metrics (Y-axis) from the HypoGen dataset. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Semantic Entropy compared to different convergent creativity metrics (Y-axis) from the BookMIA dataset [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: The effect of temperature on convergent creativity. G.2 Effect of sample size on semantic entropy In order to analyse the effect of the quantity of sam￾ples generated by the LLM (referring to the single steps we prompt it to generate in the benchmark) per step, we doubled the sample size (n=20) and ran the benchmark on GPT-4o at temperature 0.7 and 1. From [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 12
Figure 12. Figure 12: Effect of model recency on semantic entropy. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Effect of model size on semantic entropy. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Distribution of steps w.r.t. number of semantic classes generated while sampling that step. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Average semantic entropy for different steps of solutions for different LLMs. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
read the original abstract

Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Divergent creativity is measured via semantic entropy as a reference-free metric for novelty and diversity, validated against human annotations, LLM novelty judgments, and baseline diversity measures. Convergent creativity is assessed with a retrieval-based multi-agent judge for context-sensitive task fulfilment, claiming over 60% efficiency gains. The framework is validated across three domains (MacGyver for problem-solving, HypoGen for research ideation, BookMIA for creative writing) using multiple LLMs, with results on how model size, temperature, recency, and reasoning affect performance. The work positions itself as establishing a reproducible, generalizable standard for automated creativity evaluation.

Significance. If the described validations hold, the work is significant for advancing scalable creativity assessment in LLMs by decoupling the measurement apparatus from specific tasks, a clear strength that addresses limitations of prior task-coupled metrics. The multi-domain empirical results and analysis of model properties add value for understanding creative performance. The emphasis on reference-free metrics and reproducibility through validation protocols is a positive contribution to the field.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'over 60% improved efficiency' for the multi-agent judge requires an explicit statement of the baseline comparator and measurement protocol (e.g., wall-clock time or token count) to allow readers to assess the figure.
  2. [Abstract] Abstract: the three domains are described as 'qualitatively distinct' without a short justification of the dimensions of distinctness (e.g., convergent vs. divergent demands or output modalities); adding one sentence would strengthen the generality argument.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our manuscript and the recommendation for minor revision. We are encouraged by the recognition that the domain-agnostic framework, reference-free metrics, and multi-domain validations represent a meaningful contribution to scalable LLM creativity assessment.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and provided text describe a framework using semantic entropy for divergent creativity (validated externally against human annotations, LLM judgments, and baseline diversity measures) and a retrieval-based multi-agent judge for convergent creativity. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that reduce any claimed result to its inputs by construction. The separation of measurement from task and empirical validations across domains (MacGyver, HypoGen, BookMIA) stand as independent content without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, detailed axioms, or invented entities can be identified beyond the high-level separation of creativity types.

axioms (1)
  • domain assumption Creativity can be meaningfully separated into divergent (novelty/diversity) and convergent (task fulfilment) components for automated evaluation.
    The entire framework is constructed around this separation as described in the abstract.

pith-pipeline@v0.9.1-grok · 5824 in / 1386 out tokens · 26028 ms · 2026-06-27T09:52:19.311875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 1 linked inside Pith

  1. [1]

    Cheng-Han Chiang and Hung-yi Lee

    Do language models enjoy their own stories? prompting large language models for automatic story evaluation.Transactions of the Association for Com- putational Linguistics, 12:1122–1142. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evalua- tions? InProceedings of the 61st Annual Meeting of the Association for...

  2. [2]

    Matthew DeLorenzo, Vasudev Gohil, and Jeyavijayan Rajendran

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Matthew DeLorenzo, Vasudev Gohil, and Jeyavijayan Rajendran. 2024. Creativeval: Evaluating creativity of llm-based hardware code generation.2024 IEEE LLM Aided Design Workshop (LAD), pages 1–5. Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, an...

  3. [3]

    Fangrui Lv, Kaixiong Gong, Jian Liang, Xinyu Pang, and Changshui Zhang

    Chatgpt as a factual inconsistency evaluator for text summarization.Preprint, arXiv:2303.15621. Fangrui Lv, Kaixiong Gong, Jian Liang, Xinyu Pang, and Changshui Zhang. 2024. Subjective topic meets LLMs: Unleashing comprehensive, reflective and creative thinking through the negation of negation. InProceedings of the 2024 Conference on Empiri- cal Methods i...

  4. [4]

    Behnam Mohammadi

    Remote associates test, college, adult, form 1 and examiner’s manual, remote associates test, col- lege and adult forms 1 and 2. Behnam Mohammadi. 2024. Creativity has left the chat: The price of debiasing language models.Preprint, arXiv:2406.05587. Charles O’Neill, Tirthankar Ghosal, Roberta R˘aileanu, Mike Walmsley, Thang Bui, Kevin Schawinski, and Ioan...

  5. [5]

    11 Max Peeperkorn, Tom Kouwenhoven, Dan Brown, and Anna Jordanous

    Automatic scoring of metaphor creativity with large language models.Creativity Research Journal, 0(0):1–15. 11 Max Peeperkorn, Tom Kouwenhoven, Dan Brown, and Anna Jordanous. 2024. Is temperature the creativ- ity parameter of large language models?Preprint, arXiv:2405.00492. Kai Ruan, Xuan Wang, Jixiang Hong, and Hao Sun

  6. [6]

    Preprint, arXiv:2412.17596

    Liveideabench: Evaluating llms’ scientific creativity and idea generation with minimal context. Preprint, arXiv:2412.17596. Mark A. Runco and Garrett J. Jaeger. 2012. The stan- dard definition of creativity.Creativity Research Journal, 24(1):92–96. Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zet...

  7. [7]

    Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu

    Score: Story coherence and retrieval enhance- ment for ai narratives.Preprint, arXiv:2503.23512. Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu. 2023. Enhancing uncertainty- based hallucination detection with stronger focus. Preprint, arXiv:2311.13230. Lianmin Zheng, Wei-Lin Chiang, Ying S...

  8. [8]

    Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu

    Judgelm: Fine-tuned large language models are scalable judges.Preprint, arXiv:2310.17631. Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. Preprint, arXiv:1802.01886. 12 Appendix A Model Selection Our framework encompasses models of varying sizes, ages, an...

  9. [9]

    Answers to questions from other analysts Rresponse a

  10. [10]

    General opinionsR opinion a

  11. [11]

    bit" and

    Clarifying questions to other analystsq new a . (Rquestions a , Ropinion a , qnew a ) =J a(qothers a ,GET(Q a ⊕q others a , k), Inf o). (7) The analysts extract relevant fragments using prede- fined role-specific queries Qa, and questions from other analysts. Their generated insights are stored in the database. Confidence Scoring.At the end of each round ...

  12. [12]

    Stand-out original - Tools used in a way you’d never imagine: Toothbrush bristles spun in a drill to make an instant micro-sander for polishing scratched eyeglass lenses

  13. [13]

    Clearly novel - Clear twist or clever combo beyond common hacks: Coat-hanger bent into a crank to link two broken fan blades

  14. [14]

    Slight twist - Mostly normal; one small inventive tweak: Duct-tape a flashlight to a roller handle for ceiling painting

  15. [15]

    You may also find it helpful to judge using this way:

    Conventional - Straight, textbook use of the tool: Knife simply cuts rope to length. You may also find it helpful to judge using this way:

  16. [16]

    Skim question and answer to get rough idea of main goals

  17. [17]

    Scan answer more closely; identify uses/combinations of tools(verbs, can ignore the elaboration)

  18. [18]

    Pick out 1-2 uses that seem the most unconventional, novel

  19. [19]

    If torn between two levels, drop down to lower tier

    Using these 1-2 uses, tier list. If torn between two levels, drop down to lower tier

  20. [20]

    28 In the following sections, italicised text in the prompts refers to variables

    Rank individual solutions within each tier with gut feeling I guess. 28 In the following sections, italicised text in the prompts refers to variables. K Prompt for Novelty Judge Novelty Judge Prompts System Prompt Template: You are an expert judge. Your task is to compare two Question/Answer (Q/A) pairs based on a specific definition of novelty provided i...

  21. [23]

    **Queries for other agents: (format in this way:To <analyst name>: <query>...)** Begin each part of your response with [[label of part]]. E.g. [[Answering questions from other agents]]: <part of response> Relevant discussion is below:relevantdiscussion 32 Solution Analyst Discussion Prompt You are an impartial but critical ’solution analyst’, partaking in...

  22. [26]

    **Queries for other agents: (format in this way:To <analyst name>: <query>...)** Begin each part of your response with [[label of part]]. E.g. [[Answering questions from other agents]]: <part of response> Relevant discussion is below:relevantdiscussion 33 Criterion Analyst Discussion Prompt You are an impartial but critical ’criterion analyst’, partaking ...

  23. [27]

    **Clearly answering all questions/uncertainties from other agents in the discussion history, IF ANY: (format STRICTLY in this way: To <analyst name>’s question about <topic>: <answer>...)**

  24. [28]

    your main responsibility, with reference to the criterion definition:**

    **General thoughts/opinion on whether the solution fulfils the criterion criterion (succinctly) w.r.t. your main responsibility, with reference to the criterion definition:**

  25. [29]

    **Queries for other agents: (format in this way:To <analyst name>: <query>...)** Begin each part of your response with [[label of part]]. E.g. [[Answering questions from other agents]]: <part of response> Relevant discussion is below:relevantdiscussion 34 Confidence Prompt You are the impartial but critical role in the discussion provided,rolef ocus. Prob...