Art or Artifice? Large Language Models and the False Promise of Creativity

Chien-Sheng Wu; Divyansh Agarwal; Philippe Laban; Smaranda Muresan; Tuhin Chakrabarty

arxiv: 2309.14556 · v3 · pith:DXDGZHIFnew · submitted 2023-09-25 · 💻 cs.CL · cs.AI· cs.HC

Art or Artifice? Large Language Models and the False Promise of Creativity

Tuhin Chakrabarty , Philippe Laban , Divyansh Agarwal , Smaranda Muresan , Chien-Sheng Wu This is my paper

classification 💻 cs.CL cs.AIcs.HC

keywords ttcwcreativityllmsstoriescreativewritingassessmentlanguage

0 comments

read the original abstract

Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Before and After Temperature: A Distributional View of Creative LLM Generation
cs.CL 2026-05 unverdicted novelty 7.0

A per-token feature from temperature-induced changes in LLM token distributions predicts within-prompt creativity rank at Spearman rho 0.918 vs LLM judges and 0.870 vs humans, outperforming perplexity, entropy, top-1 ...
StoryAlign: Evaluating and Training Reward Models for Story Generation
cs.CL 2026-05 unverdicted novelty 7.0

StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models
cs.HC 2026-04 conditional novelty 7.0

An LLM-native five-factor psychometric instrument produces stable self-report structure but fails to predict observed behavior, and reveals a shared textual-surface bias between self-report and LLM judges that human r...