Do Large Language Models Always Tell The Same Stories?

Hans Ole Hatzel; Thennal DK

arxiv: 2606.17350 · v1 · pith:FSDWKYNUnew · submitted 2026-06-15 · 💻 cs.CL · cs.AI

Do Large Language Models Always Tell The Same Stories?

Thennal DK , Hans Ole Hatzel This is my paper

Pith reviewed 2026-06-27 02:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsstory generationnarrative diversitysimilarity judgmentscreative writingoutput homogeneityr/WritingPrompts

0 comments

The pith

Large language models produce stories more similar to each other than human authors do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models generate diverse stories or converge on similar ones. It sets up a contrastive comparison using human-written stories and prompts from online writing communities, then gathers similarity ratings from people and from three automatic measures across ten different models. The results indicate that model outputs cluster more tightly than human ones, with the strongest models settling around a single generic narrative. Standard adjustments such as changing temperature or adding negative prompts do not restore the spread seen in human writing. A sympathetic reader would care because many applications assume models can supply varied creative text when prompted repeatedly.

Core claim

Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, the study collects narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. The findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. Frontier models in particular converge on a mean generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneit

What carries the argument

Narrative similarity judgments collected from human raters and three automatic methods on stories generated from shared prompts.

If this is right

LLM outputs cluster more tightly than human stories do across the tested models.
Frontier models converge toward a single generic narrative rather than matching human variety.
Negative prompting and temperature scaling leave the observed homogeneity largely unchanged.
The pattern holds when similarity is measured by both people and automatic methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated use of the same model for creative tasks may yield less variety than commissioning multiple human writers.
The convergence could stem from training objectives that reward responses close to an average of the training data.
Similar homogeneity might appear in other open-ended generation tasks such as dialogue or world-building.

Load-bearing premise

The chosen similarity judgments measure meaningful differences in story content and style rather than only surface features.

What would settle it

A model that produces story sets whose pairwise similarities are lower than those among human stories on the same prompts would contradict the central finding.

Figures

Figures reproduced from arXiv: 2606.17350 by Hans Ole Hatzel, Thennal DK.

**Figure 2.** Figure 2: Similarity heatmap showcasing the normalized selection rate for all models annotated via narrative [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Similarity heatmap of triplets annotated by the LLM judge and with a confidence filter of 5 within the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Similarity heatmap of triplets annotated by the preference model within the closed-source, open-source, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Similarity heatmap of triplets annotated by [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Average LLM judge agreement with human annotators split by reported confidence. Neither the LLM judge nor the human annotators ever reported a confidence score of 1. As both human annotators and the LLM judge provide confidence scores, we compile a heatmap showcasing average agreement across confidence scores in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Similarity heatmap of triplets annotated by the LLM judge, showcasing the normalized selection rate [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Similarity heatmap of triplets annotated by the preference model, within the closed source, open source, [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Similarity heatmap of triplets annotated by the LLM judge, showcasing the normalized selection rate [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Similarity heatmap of triplets annotated by [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Similarity heatmap of triplets annotated by the preference model, within the OLMo pool under a [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: The annotation guidelines provided to human annotators (part 1 of 2). [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: The annotation guidelines provided to human annotators (part 2 of 2). [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for LLM-as-a-Judge triplet annotation. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt for Narrative Component Extraction. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: All LLM prompts used for generating stories. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

read the original abstract

Recent advances in large language models (LLMs) have enabled the generation of high-quality prose, yet the question of whether these models are capable of generating diverse outputs remains contested. In this work, we investigate the diversity of LLM-generated stories through the framework of narrative similarity. Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, we collect narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. Our findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. We demonstrate that frontier models in particular converge on a ``mean'' generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Finally, we show that common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs produce more similar stories than humans on this r/WritingPrompts setup, but the similarity measures lack enough validation to confirm they track narrative content rather than style or data overlap.

read the letter

The main result is that stories from 10 LLMs are rated more similar to each other than human stories from the same prompts, with frontier models especially collapsing toward a generic mean. Human judgments plus three automatic methods support the trend, and standard fixes like temperature scaling or negative prompting do not change it much.

What the paper does is apply a contrastive similarity framework to story collections and directly test mitigation attempts. Using both human and automatic annotations across multiple models is a reasonable way to measure the effect, and the r/WritingPrompts source gives a consistent prompt base.

The soft spots are in the missing specifics. No numbers on dataset size, prompt sampling, or inter-annotator agreement appear in the abstract, and there is no reported check on whether the automatic methods actually align with humans on narrative differences. The stress-test point lands here: if the automatic annotators use embeddings or LLM judges trained on overlapping data, they could rate LLM outputs as similar even when plots or characters differ. The human judgments are the critical anchor, so the paper needs to show they drive the result and that similarity goes beyond surface features.

This is for people studying output diversity in creative generation tasks. It engages the existing homogeneity literature without obvious fitting or circular claims. The evidence is suggestive but thin on the methods side.

I would bring it to a reading group to discuss the metrics. I would not cite it yet. It deserves peer review so the authors can add the validation details and address the bias risk in the automatic scores.

Referee Report

2 major / 0 minor

Summary. The paper claims that LLM-generated narratives are consistently more similar to each other than human-written stories, based on a contrastive framework using r/WritingPrompts data. It collects narrative similarity judgments from human evaluators and three automatic methods across 10 LLMs, concluding that frontier models converge on a 'mean' generic narrative approximating individual human stories but lacking collective human diversity, and that mitigation strategies like negative prompting and temperature scaling do not resolve the homogeneity.

Significance. If the central empirical claim holds after rigorous validation, the result would highlight a meaningful limitation in current LLMs for creative generation tasks, indicating reduced narrative diversity relative to human authors. The use of both human judgments and multiple automatic methods, along with the contrastive setup on real prompt data, provides a reasonable empirical foundation, though the absence of reported validation details limits immediate assessment of robustness.

major comments (2)

[Abstract] Abstract/Methods: No details are given on dataset size, prompt construction from r/WritingPrompts, inter-annotator agreement for human evaluations, or how the three automatic annotation methods were validated against human judgments. These elements are load-bearing for the central claim that similarity measures demonstrate LLM homogeneity, as the reader's assessment notes soundness at 3.0 due to this omission.
[Methods] Methods/Results: The contrastive framework does not isolate the risk that automatic methods (e.g., embeddings or LLM judges) may rate LLM outputs as more similar due to training data overlap rather than true differences in plot, character arcs, or event sequences. This directly affects the validity of the 'mean generic narrative' conclusion, as noted in the stress-test concern.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and plan to make revisions accordingly.

read point-by-point responses

Referee: [Abstract] Abstract/Methods: No details are given on dataset size, prompt construction from r/WritingPrompts, inter-annotator agreement for human evaluations, or how the three automatic annotation methods were validated against human judgments. These elements are load-bearing for the central claim that similarity measures demonstrate LLM homogeneity, as the reader's assessment notes soundness at 3.0 due to this omission.

Authors: We acknowledge the omission of these details in the current manuscript. Upon revision, we will include comprehensive information on the dataset size and prompt construction process from r/WritingPrompts. We will also report inter-annotator agreement for the human evaluations and provide validation results for the automatic methods against human judgments to substantiate the reliability of our similarity measures. revision: yes
Referee: [Methods] Methods/Results: The contrastive framework does not isolate the risk that automatic methods (e.g., embeddings or LLM judges) may rate LLM outputs as more similar due to training data overlap rather than true differences in plot, character arcs, or event sequences. This directly affects the validity of the 'mean generic narrative' conclusion, as noted in the stress-test concern.

Authors: We appreciate this point about potential confounds in automatic similarity measures. Our study relies on human judgments as the gold standard, which are unaffected by training data overlap. The automatic methods were validated to align with human judgments, and the homogeneity trend holds across both. In the revised paper, we will add a discussion of this limitation and its implications for interpreting the automatic results, while emphasizing that the core findings are robust due to the human evaluation component. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical observations from collected judgments

full rationale

The paper reports an empirical study collecting narrative similarity judgments (human plus three automatic methods) on LLM vs. human stories from r/WritingPrompts. No derivation chain, first-principles result, or prediction is claimed; the central finding is a direct statistical observation from the gathered data. No self-definitional relations, fitted inputs presented as predictions, or load-bearing self-citations appear in the abstract or described framework. The work is self-contained as an empirical measurement exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities used in the study.

pith-pipeline@v0.9.1-grok · 5669 in / 1049 out tokens · 36618 ms · 2026-06-27T02:54:27.371890+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 5 internal anchors

[1]

SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning

Narrative Similarity – Annotation Guidelines. Hans Ole Hatzel, Ekaterina Artemova, Haimo Paul Stiemer, Evelyn Gius, and Chris Biemann. 2026. SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning.Preprint, arXiv:2604.21782. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text D...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

LLMs generate kitsch.arXiv preprint arXiv:2604.25929. Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gus- tavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Hen- rique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, and 28 others. 2025. G...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Olmo 3

Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models. InFindings of the Association for Computational Linguistics: EACL 2026, pages 2639–2660, Rabat, Morocco. Association for Computational Linguistics. Surabhi S. Nath, Guiomar del Cuvillo y Schršder, and Claire Stevenson. 2025. Pencils to Pixels: A Sys- tematic Stud...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

OpenAI GPT-5 System Card

Avoidance Decoding for Diverse Multi-Branch Story Generation. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 7489–7505, Suzhou, China. Asso- ciation for Computational Linguistics. Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, and Tuhin Chakrabarty. 2026. Death of the Novel(ty): Beyond n-Gram Novelty as ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Qwen3 Technical Report

A Survey on LLMs for Story Generation. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 13954–13966, Suzhou, China. Association for Computational Linguistics. 10 Yufei Tian, Tenghao Huang, Miri Liu, Derek Jiang, Alexander Spangher, Muhao Chen, Jonathan May, and Nanyun Peng. 2024a. Are Large Language Mod- els Capable of Gen...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Read theReference Storycarefully, noting its key narrative elements
[7]

ReadStory AandStory Bcompletely
[8]

Compare each candidate’s narrative structure to the reference
[9]

Select which story (A or B) is more narratively similar
[10]

faithful, loyal man

Rate your confidence (1-5) in this choice. Narrative Similarity The narrative similarity of stories can be broken down into three core aspects: (1) the abstract themes of the story, (2) the course of action, and (3) the story outcomes. At one extreme, this means that the story deals with the same themes and tells the same order of events with an identical...
[11]

On the week-long journey from Europe to the Americas, the crew members get into a heated conflict about the best ration packages

Overall abstract theme: Describe in brief the central ideas, core motifs and defining constellation of problems. For example, in both these stories: A: "On the week-long journey from Europe to the Americas, the crew members get into a heated conflict about the best ration packages." B: "The flight to Mars is long. After several weeks, the astronauts becom...
[12]

After the ship capsizes and Alice barely makes it out alive, she starts living life to the fullest with a new-found perspective about how precious life is

Course of action/events: Describe in brief the sequence of events that actually hap- pens in the story. For example, in the following stories: A: "After the ship capsizes and Alice barely makes it out alive, she starts living life to the fullest with a new-found perspective about how precious life is." Events: Alice’s ship capsizes. Alice barely makes it ...
[13]

Anna loses her purse. She retraces her steps but cannot find it. Dan finds it and helpfully returns it to her

The outcomes: Describe in brief the final ending or outcomes of the story. For exam- ple, in the following stories: A: "Anna loses her purse. She retraces her steps but cannot find it. Dan finds it and helpfully returns it to her." Outcome: Someone finds a lost item and returns to owner. B: "Brian lost his backpack. He was terrified because there were imp...

[1] [1]

SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning

Narrative Similarity – Annotation Guidelines. Hans Ole Hatzel, Ekaterina Artemova, Haimo Paul Stiemer, Evelyn Gius, and Chris Biemann. 2026. SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning.Preprint, arXiv:2604.21782. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text D...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

LLMs generate kitsch.arXiv preprint arXiv:2604.25929. Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gus- tavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Hen- rique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, and 28 others. 2025. G...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Olmo 3

Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models. InFindings of the Association for Computational Linguistics: EACL 2026, pages 2639–2660, Rabat, Morocco. Association for Computational Linguistics. Surabhi S. Nath, Guiomar del Cuvillo y Schršder, and Claire Stevenson. 2025. Pencils to Pixels: A Sys- tematic Stud...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

OpenAI GPT-5 System Card

Avoidance Decoding for Diverse Multi-Branch Story Generation. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 7489–7505, Suzhou, China. Asso- ciation for Computational Linguistics. Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, and Tuhin Chakrabarty. 2026. Death of the Novel(ty): Beyond n-Gram Novelty as ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Qwen3 Technical Report

A Survey on LLMs for Story Generation. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 13954–13966, Suzhou, China. Association for Computational Linguistics. 10 Yufei Tian, Tenghao Huang, Miri Liu, Derek Jiang, Alexander Spangher, Muhao Chen, Jonathan May, and Nanyun Peng. 2024a. Are Large Language Mod- els Capable of Gen...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Read theReference Storycarefully, noting its key narrative elements

[7] [7]

ReadStory AandStory Bcompletely

[8] [8]

Compare each candidate’s narrative structure to the reference

[9] [9]

Select which story (A or B) is more narratively similar

[10] [10]

faithful, loyal man

Rate your confidence (1-5) in this choice. Narrative Similarity The narrative similarity of stories can be broken down into three core aspects: (1) the abstract themes of the story, (2) the course of action, and (3) the story outcomes. At one extreme, this means that the story deals with the same themes and tells the same order of events with an identical...

[11] [11]

On the week-long journey from Europe to the Americas, the crew members get into a heated conflict about the best ration packages

Overall abstract theme: Describe in brief the central ideas, core motifs and defining constellation of problems. For example, in both these stories: A: "On the week-long journey from Europe to the Americas, the crew members get into a heated conflict about the best ration packages." B: "The flight to Mars is long. After several weeks, the astronauts becom...

[12] [12]

After the ship capsizes and Alice barely makes it out alive, she starts living life to the fullest with a new-found perspective about how precious life is

Course of action/events: Describe in brief the sequence of events that actually hap- pens in the story. For example, in the following stories: A: "After the ship capsizes and Alice barely makes it out alive, she starts living life to the fullest with a new-found perspective about how precious life is." Events: Alice’s ship capsizes. Alice barely makes it ...

[13] [13]

Anna loses her purse. She retraces her steps but cannot find it. Dan finds it and helpfully returns it to her

The outcomes: Describe in brief the final ending or outcomes of the story. For exam- ple, in the following stories: A: "Anna loses her purse. She retraces her steps but cannot find it. Dan finds it and helpfully returns it to her." Outcome: Someone finds a lost item and returns to owner. B: "Brian lost his backpack. He was terrified because there were imp...