Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

David Mimno; Sil Hamilton

arxiv: 2605.26492 · v1 · pith:4LVLFVLGnew · submitted 2026-05-26 · 💻 cs.CL · cs.AI· cs.LG

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

Sil Hamilton , David Mimno This is my paper

Pith reviewed 2026-06-29 18:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM story generationoutput diversitypreference datamodel alignmentrepetitive tropeslighthouse storiespost-training effects

0 comments

The pith

Small preference datasets cause 88.3 percent of LLM stories to reuse the same eleven words.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Researchers generated 20,000 stories from four current models using five prompts and found that eleven specific words appear in 88.3 percent of the outputs with little variation across models. The repeated words consist of names such as Elias, Mara, and Elara, the lighthouse setting, and professions including clockmaker and librarian. These tokens occur rarely in published literature and pre-training data but show up frequently in preference data used for alignment. The lighthouse-style stories prove less common than typical post-training outputs, which more often reference copyrighted characters or adult content. The results indicate that small preference datasets can exert outsized influence when paired with alignment methods.

Core claim

Sampling 20,000 stories from four current models using five prompts reveals that 11 words occur in 88.3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian). These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these "lighthouse" stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impa

What carries the argument

The mechanism by which small preference datasets introduce repeated tokens into post-trained models, overriding patterns from pre-training data.

If this is right

Models from different providers converge on the same repeated tokens because they draw from overlapping preference data.
Alignment algorithms amplify the frequency of items that are rare in pre-training but present in small preference sets.
Post-training outputs exhibit lower diversity than would be predicted from pre-training data alone.
The lighthouse pattern appears less often than copyrighted or adult content in typical aligned model outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curating larger and more varied preference datasets could reduce repetition in open-ended generation tasks.
The same mechanism may produce repeated tropes in other creative outputs such as poetry or role-play dialogue.
Controlled experiments that swap preference datasets between models would isolate the contribution of alignment data.
Monitoring for similar low-diversity clusters could help identify unintended effects of alignment in non-story domains.

Load-bearing premise

The five prompts and four models produce outputs representative of typical LLM story generation, and the identified words are verifiably rare in pre-training data while common in the relevant preference datasets.

What would settle it

Counting the eleven words in the actual preference datasets used for these models and finding them absent or rare, or training a model on preference data that excludes those words and observing repetition rates below 20 percent.

Figures

Figures reproduced from arXiv: 2605.26492 by David Mimno, Sil Hamilton.

**Figure 2.** Figure 2: Written by Gemini 3.1 Flash-Lite when prompted to “write a story.” Lighthouses are present in half of all 20,000 stories generated for this experiment. (Ouyang et al., 2022; Bai et al., 2022; Hamilton, 2024). Fears of mode collapse deepened with growing synthetic data in training (Gerstgrasser et al., 2024; Shumailov et al., 2024). Mode collapse has been demonstrated in linguistic markers (Paech, 2025), … view at source ↗

**Figure 3.** Figure 3: t-SNE of Topic model over all stories in OLMo 3’s post-training set (left), and Core stories (right). Stories [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models using five prompts. We find that 11 words occur in 88.3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian). These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these "lighthouse" stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents a narrow set of 11 tokens dominating 88% of stories from four models and attributes it to preference data, but the supporting counts and controls are not shown.

read the letter

The core observation is that 11 words turn up in 88.3% of 20,000 sampled stories, with almost no variation across the four models tested. The authors name the words (Elias, Mara, Elara, lighthouse, clockmaker, librarian and a few others) and note they are uncommon in published fiction and pre-training data but appear in preference datasets.

What the work does cleanly is run a simple frequency count on a reasonably sized sample and show the pattern is stable. That part is straightforward and worth recording.

The soft spots are in the causal step. The abstract states the tokens are rare in pre-training and literature without giving token counts, corpus sizes, or the actual datasets checked. It also uses only five prompts; nothing in the provided text shows these prompts were chosen to match typical user story requests or that other prompt styles produce different results. The claim that the pattern comes from "small datasets combined with powerful alignment algorithms" therefore rests on the frequency observation plus an untested assumption about data sources.

The paper is useful for people who build or tune story-generation systems and want a concrete example of how post-training can collapse output variety. It is less useful for anyone who needs the rarity claims verified or who wants to know whether the effect survives different prompts or larger samples.

It is worth sending to referees. The empirical pattern is clear enough that a reviewer can check the missing counts and prompt details without starting from scratch.

Referee Report

2 major / 0 minor

Summary. The paper claims that LLM-generated stories exhibit strikingly low diversity: across 20,000 stories sampled from four current models using five prompts, 11 specific tokens (names such as Elias, Mara, Elara; the setting 'lighthouse'; professions such as clockmaker and librarian) appear in 88.3% of outputs, with little model-to-model variation. The authors argue these tokens are rare in published literature and pre-training corpora yet common in preference data used for alignment, and conclude that small preference datasets combined with powerful alignment algorithms can produce disproportionate effects on creative output.

Significance. If the frequency counts and the attribution to preference data hold after verification, the result would be significant for post-training research: it supplies concrete evidence that alignment can systematically suppress lexical diversity even when the underlying base models differ. The 20,000-story sample size is a clear strength, providing reliable empirical support for the reported 88.3% figure and enabling direct comparison across models.

major comments (2)

[Abstract] Abstract: the central causal claim—that the 11 tokens 'do not often occur in published literature nor pre-training data, but they are found in preference data'—is asserted without any frequency counts, dataset citations, or verification procedure. This assertion is load-bearing for the conclusion that the phenomenon results from 'small datasets combined with powerful alignment algorithms'; absent the supporting data, the link between observation and explanation remains unestablished.
[Abstract] Abstract / implied methods: the five prompts and four models are presented as representative of 'typical LLM story generation,' yet no selection criteria, prompt wording, model versions, or sampling protocol are supplied. If the prompts are atypical or the models share undisclosed training overlap, the 88.3% dominance cannot be generalized, undermining the claim that the pattern is a general feature of current LLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights areas where additional empirical support and methodological transparency are needed. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core findings.

read point-by-point responses

Referee: [Abstract] Abstract: the central causal claim—that the 11 tokens 'do not often occur in published literature nor pre-training data, but they are found in preference data'—is asserted without any frequency counts, dataset citations, or verification procedure. This assertion is load-bearing for the conclusion that the phenomenon results from 'small datasets combined with powerful alignment algorithms'; absent the supporting data, the link between observation and explanation remains unestablished.

Authors: We agree that the abstract presents this attribution without accompanying quantitative evidence or citations, leaving the causal link to preference data insufficiently supported. The full manuscript contains qualitative discussion of token rarity but lacks the explicit frequency tables, dataset references, and verification steps required to substantiate the claim. In the revision we will add a concise methods subsection (and update the abstract) that reports relative frequencies drawn from samples of published literature (e.g., Project Gutenberg), pre-training corpora (e.g., The Pile), and publicly documented preference datasets, together with the exact procedure used to identify the 11 tokens in those sources. This will make the link between observation and explanation verifiable. revision: yes
Referee: [Abstract] Abstract / implied methods: the five prompts and four models are presented as representative of 'typical LLM story generation,' yet no selection criteria, prompt wording, model versions, or sampling protocol are supplied. If the prompts are atypical or the models share undisclosed training overlap, the 88.3% dominance cannot be generalized, undermining the claim that the pattern is a general feature of current LLMs.

Authors: We concur that the abstract and methods description omit the information needed to evaluate representativeness. The manuscript identifies four current models and five prompts but supplies neither the precise wording, version numbers, selection rationale, nor sampling hyperparameters. In the revised version we will expand the methods section to list the exact prompt texts, model identifiers and release versions, temperature/top-p settings, and the criteria used to select these models and prompts as representative of frontier LLM story-generation practice. This addition will allow readers to assess potential overlap and generalizability directly. revision: yes

Circularity Check

0 steps flagged

Empirical frequency counts from sampled outputs; no derivation or self-referential reduction

full rationale

The paper reports direct empirical observations: sampling 20,000 stories from four models with five prompts, then counting token frequencies (11 words in 88.3% of outputs). No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The claim that the tokens are rare in pre-training data but present in preference data is asserted without reduction to the paper's own fitted quantities or prior self-citations; it is an external empirical assertion (even if verification details are limited). The central result is therefore self-contained as a frequency measurement and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Purely empirical frequency study with no mathematical derivations, free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5666 in / 1087 out tokens · 37211 ms · 2026-06-29T18:50:34.649231+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages · 1 internal anchor

[1]

John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yuqian Sun, and Max Kreminski

Latent dirichlet allocation.Journal of machine Learning research, 3(Jan):993–1022. John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yuqian Sun, and Max Kreminski. 2025. Modifying Large Language Model Post-Training for Diverse Creative Writing. Anil R. Doshi and Oliver P. Hauser. 2024. Generative AI enhances individual creativity but reduces th...

work page arXiv 2025
[2]

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu

Optimal detection of changepoints with a lin- ear computational cost.Journal of the American Statistical Association, 107(500):1590–1598. Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. 2024. Understanding the Ef- fects of RLHF on LLM Generalisation and Diversity. InICLR 2024...

2024
[3]

Training language models to follow instructions with human feedback

Multi-Novelty: Improve the Diversity and Novelty of Contents Generated by Large Language Models via inference-time Multi-Views Brainstorm- ing. Franco Moretti. 2000. The Slaughterhouse of Literature. Modern Language Quarterly, 61(1):207–228. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarw...

work page internal anchor Pith review Pith/arXiv arXiv 2000
[4]

Who are the characters in the text?
[5]

What is the character's role in the text?,→
[6]

character_names

What is the setting? Return JSON only in this exact schema: {{ "character_names": ["first name only", ...],,→ "settings": ["place or location noun phrase", ...],,→ "professions": ["profession or stable role noun phrase", ...],→ }} Additional rules: - For`character_names`, include only first names for named human or human-like characters. ,→ ,→ - For`profe...

work page arXiv

[1] [1]

John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yuqian Sun, and Max Kreminski

Latent dirichlet allocation.Journal of machine Learning research, 3(Jan):993–1022. John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yuqian Sun, and Max Kreminski. 2025. Modifying Large Language Model Post-Training for Diverse Creative Writing. Anil R. Doshi and Oliver P. Hauser. 2024. Generative AI enhances individual creativity but reduces th...

work page arXiv 2025

[2] [2]

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu

Optimal detection of changepoints with a lin- ear computational cost.Journal of the American Statistical Association, 107(500):1590–1598. Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. 2024. Understanding the Ef- fects of RLHF on LLM Generalisation and Diversity. InICLR 2024...

2024

[3] [3]

Training language models to follow instructions with human feedback

Multi-Novelty: Improve the Diversity and Novelty of Contents Generated by Large Language Models via inference-time Multi-Views Brainstorm- ing. Franco Moretti. 2000. The Slaughterhouse of Literature. Modern Language Quarterly, 61(1):207–228. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarw...

work page internal anchor Pith review Pith/arXiv arXiv 2000

[4] [4]

Who are the characters in the text?

[5] [5]

What is the character's role in the text?,→

[6] [6]

character_names

What is the setting? Return JSON only in this exact schema: {{ "character_names": ["first name only", ...],,→ "settings": ["place or location noun phrase", ...],,→ "professions": ["profession or stable role noun phrase", ...],→ }} Additional rules: - For`character_names`, include only first names for named human or human-like characters. ,→ ,→ - For`profe...

work page arXiv