Dark & Stormy: Modeling Humor in Sentences from the Bulwer-Lytton Fiction Contest
Pith reviewed 2026-05-18 03:06 UTC · model grok-4.3
The pith
Sentences from the Bulwer-Lytton contest show that standard humor detectors fail on bad writing and that language models exaggerate literary devices when copying the style.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Bulwer-Lytton corpus mixes elements already studied in humor research with less common literary devices, standard detection models perform poorly on it, and prompted language models imitate the contest style by over-using certain devices while producing more novel adjective-noun bigrams than the original human entries.
What carries the argument
The Bulwer-Lytton corpus together with annotations for literary devices and counts of novel adjective-noun bigrams, used to compare human contest entries against model-generated imitations.
If this is right
- Humor detection systems need training examples that include intentionally bad writing to become more complete.
- Language models can be guided to reduce overuse of specific literary devices when asked to create humorous text.
- Novelty measures on word pairs offer one way to quantify how generated humor differs from human attempts.
- Corpora of deliberately poor prose can serve as a test bed for studying the boundaries of computational humor.
Where Pith is reading between the lines
- The gap between human and model outputs suggests that current generation methods lack fine control over subtlety in creative language.
- Adding contest-style bad humor to evaluation sets could expose weaknesses in alignment techniques for creative tasks.
- Similar device-overuse patterns might appear in other domains where models are asked to imitate distinctive human styles.
Load-bearing premise
The chosen literary-device labels and bigram-novelty counts actually measure what makes the humor bad rather than reflecting only the contest rules or the annotation method itself.
What would settle it
A humor detector that reaches high accuracy on the Bulwer-Lytton sentences using only training data from existing humor collections, or a language model prompt that produces bigram novelty rates matching the human contest entries, would contradict the reported performance gaps and generation differences.
read the original abstract
Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand "bad" humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper curates and releases a corpus of intentionally bad humorous sentences from the Bulwer-Lytton Fiction Contest. It shows that standard humor detection models achieve low performance on this data, analyzes the distribution of literary devices (puns, irony, metaphor, metafiction, simile) relative to existing humor corpora, and demonstrates that LLMs prompted to generate contest-style sentences over-use selected devices while producing substantially higher rates of novel adjective-noun bigrams than the human-authored examples.
Significance. If the quantitative comparisons hold, the work supplies a reproducible resource for studying the full spectrum of humor, including its deliberately poor variants, and documents concrete limitations of current detection and generation systems. The public release of data, code, and analysis scripts is a clear strength that enables direct follow-up experiments.
major comments (2)
- [§4] §4 (Literary Device Analysis): the reported frequencies of metaphor, metafiction, and simile are presented as distinguishing the Bulwer-Lytton sentences from prior humor datasets, yet no inter-annotator agreement statistics or explicit annotation guidelines are supplied. Without these, it is impossible to determine whether the observed differences are robust or sensitive to annotator interpretation, directly affecting the claim that the corpus combines familiar and novel device profiles.
- [§5] §5 (LLM Synthesis and Bigrams): the claim that LLM-generated sentences contain 'far more novel adjective-noun bigrams' depends on an unspecified reference corpus for the novelty baseline. If the reference is drawn from general rather than literary-fiction text, the elevated rate may be an artifact of the contest's purple-prose format rather than a general property of LLM exaggeration of bad humor.
minor comments (2)
- [Abstract / §3] The abstract and §3 should state the exact corpus size, selection criteria for the analyzed subset, and any pre-specification of the device taxonomy before annotation began.
- [Figures/Tables in §4–5] Figure or table captions for the device-frequency and bigram-novelty comparisons should include the precise statistical test and correction method used to establish significance.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help strengthen the clarity and reproducibility of our work. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Literary Device Analysis): the reported frequencies of metaphor, metafiction, and simile are presented as distinguishing the Bulwer-Lytton sentences from prior humor datasets, yet no inter-annotator agreement statistics or explicit annotation guidelines are supplied. Without these, it is impossible to determine whether the observed differences are robust or sensitive to annotator interpretation, directly affecting the claim that the corpus combines familiar and novel device profiles.
Authors: We agree that explicit annotation guidelines and inter-annotator agreement statistics improve the robustness of the literary device analysis. In the revised manuscript we will add the full annotation guidelines to an appendix. The device labels were produced by the first author following those guidelines, with the second author independently labeling a 20% subset for verification; we will report the resulting agreement statistics (Cohen's kappa) in the revised §4. These additions directly address concerns about annotator sensitivity while preserving the observed distributional differences between the Bulwer-Lytton corpus and prior humor datasets. revision: yes
-
Referee: [§5] §5 (LLM Synthesis and Bigrams): the claim that LLM-generated sentences contain 'far more novel adjective-noun bigrams' depends on an unspecified reference corpus for the novelty baseline. If the reference is drawn from general rather than literary-fiction text, the elevated rate may be an artifact of the contest's purple-prose format rather than a general property of LLM exaggeration of bad humor.
Authors: We will clarify in the revised §5 that novelty is defined relative to the human-authored Bulwer-Lytton sentences themselves (i.e., bigrams appearing in LLM outputs but absent from the human contest entries). This within-style comparison is intentional: it isolates whether LLMs exaggerate the contest's characteristic overuse of unusual adjective-noun combinations beyond what human writers produce in the same purple-prose format. We will also add a brief discussion acknowledging that the contest style itself favors rare bigrams, yet the substantially higher rate in the LLM outputs still indicates an exaggeration effect specific to the generation process. revision: yes
Circularity Check
No significant circularity: empirical measurements on external corpus
full rationale
The paper's central claims rest on curating an external Bulwer-Lytton corpus, evaluating off-the-shelf humor detection models, manually annotating literary devices, and counting novel adjective-noun bigrams in human vs. LLM-generated text. These are direct observational comparisons against independent baselines and existing datasets. No equations, fitted parameters presented as predictions, self-citations for uniqueness theorems, or internal definitions that loop back to the target quantities appear in the provided text. The analysis is self-contained against external benchmarks and replication via the released data and code.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Literary devices such as metaphor, simile, and metafiction can be reliably annotated in short sentences by human raters.
- domain assumption Standard humor detection models trained on existing datasets represent the current state of the art for general humor recognition.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We extract all ANs from all our datasets ... and query their count in the DCLM pre-training corpus ... BL sentences have many more novel ANs per sentence than either of our baseline datasets
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.