LLMs Generate Kitsch

Alexander Koller; Stefan Ortlieb; Xenia Klinge

arxiv: 2604.25929 · v1 · submitted 2026-04-01 · 💻 cs.CL

LLMs Generate Kitsch

Xenia Klinge , Stefan Ortlieb , Alexander Koller This is my paper

Pith reviewed 2026-05-13 22:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelskitschtext generationcreativityempirical evaluationtraining objectivesperception study

0 comments

The pith

Large language models systematically generate kitsch because of how they are trained on next-token prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the contradiction that LLM outputs often receive high ratings in controlled tests while still feeling generic and emotionally shallow. It resolves the tension by proposing that the models produce kitsch—conventional, sentimental, and overly polished work—as a direct result of their training process. Empirical tests confirm that readers judge LLM-generated stories as kitschier than human ones once individual definitions of kitsch are held constant. The argument carries implications for how creativity is measured in future experiments and for practical uses such as research writing or code generation.

Core claim

LLMs generate kitsch as a systematic outcome of their training, which favors statistically frequent and conventional patterns over original or risky ones, producing artifacts that readers experience as hollow despite surface-level appeal.

What carries the argument

Next-token prediction training objective, which rewards outputs matching common training-data patterns and thereby favors the safe, sentimental, and conventional qualities that define kitsch.

If this is right

Readers perceive LLM stories as kitschier once their personal definition of kitsch is controlled for in the rating task.
Evaluations of LLM creativity must treat kitsch as a separate dimension from overall quality or human-likeness.
Applications of LLMs to open-ended creative work such as research or coding inherit the same bias toward conventional outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training changes that penalize predictability could reduce the kitsch output without harming fluency.
The same mechanism likely applies to image, music, or video generators trained on similar objectives.
Success metrics for creative AI may need to include explicit checks for conventionality rather than relying on human preference ratings alone.

Load-bearing premise

Kitsch can be isolated and measured as a distinct property independent of other quality judgments, and training is the primary cause rather than prompt design or model scale.

What would settle it

An experiment in which readers rate LLM stories as no kitschier than human stories when definitions are controlled, or a model trained under a different objective that eliminates the elevated kitsch ratings.

Figures

Figures reproduced from arXiv: 2604.25929 by Alexander Koller, Stefan Ortlieb, Xenia Klinge.

**Figure 2.** Figure 2: The two-step story generation method from human original (left) to LLM-generated variant (bottom [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Full version of the LLM-generated story in [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly used to generate pictures, texts, music, videos, and other works that have traditionally required human creativity. LLM-generated artifacts are often rated better than human-generated works in controlled studies. At the same time, they can come across as generic and hollow. We propose to resolve this tension by arguing that LLMs systematically generate kitsch, and that this is a consequence of the way in which they are trained. We also show empirically that readers perceive LLM-generated stories as kitschier, if we control for their definition of "kitsch". We discuss implications for the design of future studies and for creative tasks such as research and coding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims LLMs produce kitsch due to training and backs it with a reader study, but the causal claim lacks controls and the methods details are thin.

read the letter

The main point is that LLMs generate stories readers rate as kitschier than human ones when the definition of kitsch is held constant, and the authors link this to the training process itself. This framing tries to explain why generated text often scores well on average but still feels generic or shallow. The reader study is the concrete piece of evidence they offer for the perception side of the argument. It is new in applying the kitsch concept from aesthetics to LLM output in a controlled way, and the study design at least attempts to standardize what participants mean by the term. That is a step beyond informal observations about bland generated text. The paper does a reasonable job laying out implications for how creative tasks and evaluations might need to change if this pattern holds. It points toward adjustments in training or in how we measure success on storytelling and similar tasks. The soft spots sit in the causal part and the reporting. The training objective is treated as the primary driver without tests that vary scale, prompts, or alignment while holding content fixed, so other factors could explain the kitsch ratings just as easily. The abstract also omits sample size, story selection rules, and any reliability or statistical checks, which makes it hard to gauge how solid the result actually is. These gaps are real but not fatal to the idea. The work is aimed at people in computational creativity and LLM evaluation who want a different lens on output quality. A reader already thinking about why high-rated generations lack depth would find the framing useful even if they question the training link. It deserves peer review because the core observation is worth testing properly and the experiment idea can be refined with more controls and full methods reporting.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs systematically generate kitsch as a direct consequence of their training objectives, which accounts for the observed tension between high controlled-study ratings and the generic, hollow quality of outputs. It supports this via an empirical reader study showing that participants rate LLM-generated stories as kitschier than human ones when their individual definitions of kitsch are controlled for, and discusses implications for evaluation protocols in creative tasks such as research and coding.

Significance. If the central claim is substantiated with adequate controls, the work supplies a coherent explanatory framework for limitations in LLM creativity that goes beyond surface-level quality metrics. It could usefully inform the design of future human-AI comparison studies and prompt the development of training or evaluation methods that better distinguish depth from superficial appeal.

major comments (2)

[Empirical results section] Empirical results section: the reader-perception experiment supplies no information on sample size, story selection criteria, statistical tests, or inter-rater reliability. These omissions prevent assessment of whether the reported difference in kitsch perception is robust or generalizable, directly undermining the empirical support for the central claim.
[Training mechanism discussion] Training-to-kitsch argument (likely §3): the manuscript asserts rather than derives that the pretraining objective is the primary driver of kitsch; no ablations are reported that vary scale, prompt design, or post-training alignment while holding story content fixed. Without such controls, the causal attribution remains untested and the claim is not load-bearing.

minor comments (2)

[Abstract] Abstract: the clause 'if we control for their definition of kitsch' should specify the exact experimental procedure used to implement that control.
[Introduction] Notation and terminology: ensure consistent use of 'kitsch' across the manuscript and provide a brief operational definition early in the text to aid readers unfamiliar with the aesthetic concept.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas where the manuscript can be clarified and strengthened. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Empirical results section] Empirical results section: the reader-perception experiment supplies no information on sample size, story selection criteria, statistical tests, or inter-rater reliability. These omissions prevent assessment of whether the reported difference in kitsch perception is robust or generalizable, directly undermining the empirical support for the central claim.

Authors: We agree that these methodological details are essential for assessing robustness. In the revised manuscript we will expand the empirical results section to report the sample size, the criteria used for selecting and matching stories, the statistical tests applied, and inter-rater reliability measures. These additions will allow readers to evaluate the strength and generalizability of the observed difference in kitsch perception. revision: yes
Referee: [Training mechanism discussion] Training-to-kitsch argument (likely §3): the manuscript asserts rather than derives that the pretraining objective is the primary driver of kitsch; no ablations are reported that vary scale, prompt design, or post-training alignment while holding story content fixed. Without such controls, the causal attribution remains untested and the claim is not load-bearing.

Authors: We will revise §3 to derive the link more explicitly: the autoregressive pretraining objective minimizes cross-entropy loss and thereby favors high-probability continuations drawn from the training distribution, which statistically favors conventional, formulaic patterns that readers perceive as kitsch. While the current work does not include ablations that vary scale, prompt design, or alignment with content held fixed, the core mechanism follows directly from the properties of next-token prediction itself rather than from any particular model variant. We will add a limitations paragraph noting the absence of such controls and suggesting them as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; central claim is an interpretive assertion supported by independent empirical evidence

full rationale

The paper asserts that LLMs generate kitsch as a consequence of training but provides no equations, fitted parameters, or self-citations that reduce the claim to its own inputs by construction. The reader-perception experiment is described as controlling for participants' definition of kitsch and is presented as separate evidence rather than a statistical renaming of a fit. No uniqueness theorems, ansatzes smuggled via citation, or self-definitional loops appear in the abstract or described structure. The derivation chain is therefore self-contained as an argument plus observation, with no load-bearing step that collapses to a tautology or prior self-result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that kitsch is a stable, controllable perceptual category and that next-token training directly produces it; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Kitsch is a measurable perceptual quality that can be isolated by controlling for individual definitions in rating tasks
The empirical component depends on this assumption to interpret the reader study results.

pith-pipeline@v0.9.0 · 5400 in / 1213 out tokens · 54149 ms · 2026-05-13T22:21:40.350081+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Rolf Reber, Piotr Winkielman, and Norbert Schwarz

Processing Fluency and Aesthetic Pleasure: Is Beauty in the Perceiver’s Processing Experience? Personality and Social Psychology Review 8(4):364– 382. Rolf Reber, Piotr Winkielman, and Norbert Schwarz

work page
[2]

Psychological Science 9(1):45–48

Effects of perceptual fluency on affective judg- ments. Psychological Science 9(1):45–48. John Ruwitch. 2025. ‘AI slop’ videos may be annoying, but they’re racking up views — and ad money. NPR, All Things Considered. Accessed on 15 October 2025. Roger Scruton. 2014. A Point of View: The strangely enduring power of kitsch. BBC News. Retrieved on 06 June 20...

work page 2025
[3]

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto

The Ideation-Execution Gap: Execution Out - comes of LLM-Generated versus Human Research Ideas. arXiv: 2506.20803 [cs.CL]. Yikang Sun, Cheng-Hsiang Yang, Yanru Lyu, and Rungtai Lin. 2022. From Pigments to Pixels: A Com- parison of Human and AI Painting. Applied Sciences 12(8). Gyburg Uhlmann. 2025. The Even Sheen of AI: Kitsch, LLMs, and Homogeneity. arXi...

work page arXiv 2022

[1] [1]

Rolf Reber, Piotr Winkielman, and Norbert Schwarz

Processing Fluency and Aesthetic Pleasure: Is Beauty in the Perceiver’s Processing Experience? Personality and Social Psychology Review 8(4):364– 382. Rolf Reber, Piotr Winkielman, and Norbert Schwarz

work page

[2] [2]

Psychological Science 9(1):45–48

Effects of perceptual fluency on affective judg- ments. Psychological Science 9(1):45–48. John Ruwitch. 2025. ‘AI slop’ videos may be annoying, but they’re racking up views — and ad money. NPR, All Things Considered. Accessed on 15 October 2025. Roger Scruton. 2014. A Point of View: The strangely enduring power of kitsch. BBC News. Retrieved on 06 June 20...

work page 2025

[3] [3]

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto

The Ideation-Execution Gap: Execution Out - comes of LLM-Generated versus Human Research Ideas. arXiv: 2506.20803 [cs.CL]. Yikang Sun, Cheng-Hsiang Yang, Yanru Lyu, and Rungtai Lin. 2022. From Pigments to Pixels: A Com- parison of Human and AI Painting. Applied Sciences 12(8). Gyburg Uhlmann. 2025. The Even Sheen of AI: Kitsch, LLMs, and Homogeneity. arXi...

work page arXiv 2022