pith. sign in

arxiv: 2510.24538 · v2 · submitted 2025-10-28 · 💻 cs.CL

Dark & Stormy: Modeling Humor in Sentences from the Bulwer-Lytton Fiction Contest

Pith reviewed 2026-05-18 03:06 UTC · model grok-4.3

classification 💻 cs.CL
keywords humor detectionBulwer-Lytton contestbad humorliterary deviceslarge language modelstext generationcorpus analysissatirical writing
0
0 comments X

The pith

Sentences from the Bulwer-Lytton contest show that standard humor detectors fail on bad writing and that language models exaggerate literary devices when copying the style.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper assembles a collection of deliberately awkward sentences from the Bulwer-Lytton Fiction Contest to examine intentionally poor humor. It finds that existing humor detection systems achieve low performance on this material. The sentences blend familiar humor markers such as puns and irony with additional devices including metaphor, simile, and metafiction. When large language models receive prompts to produce similar sentences they repeat the same devices more often than human authors and generate a higher share of unusual adjective-noun pairs.

Core claim

The Bulwer-Lytton corpus mixes elements already studied in humor research with less common literary devices, standard detection models perform poorly on it, and prompted language models imitate the contest style by over-using certain devices while producing more novel adjective-noun bigrams than the original human entries.

What carries the argument

The Bulwer-Lytton corpus together with annotations for literary devices and counts of novel adjective-noun bigrams, used to compare human contest entries against model-generated imitations.

If this is right

  • Humor detection systems need training examples that include intentionally bad writing to become more complete.
  • Language models can be guided to reduce overuse of specific literary devices when asked to create humorous text.
  • Novelty measures on word pairs offer one way to quantify how generated humor differs from human attempts.
  • Corpora of deliberately poor prose can serve as a test bed for studying the boundaries of computational humor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap between human and model outputs suggests that current generation methods lack fine control over subtlety in creative language.
  • Adding contest-style bad humor to evaluation sets could expose weaknesses in alignment techniques for creative tasks.
  • Similar device-overuse patterns might appear in other domains where models are asked to imitate distinctive human styles.

Load-bearing premise

The chosen literary-device labels and bigram-novelty counts actually measure what makes the humor bad rather than reflecting only the contest rules or the annotation method itself.

What would settle it

A humor detector that reaches high accuracy on the Bulwer-Lytton sentences using only training data from existing humor collections, or a language model prompt that produces bigram novelty rates matching the human contest entries, would contradict the reported performance gaps and generation differences.

read the original abstract

Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand "bad" humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper curates and releases a corpus of intentionally bad humorous sentences from the Bulwer-Lytton Fiction Contest. It shows that standard humor detection models achieve low performance on this data, analyzes the distribution of literary devices (puns, irony, metaphor, metafiction, simile) relative to existing humor corpora, and demonstrates that LLMs prompted to generate contest-style sentences over-use selected devices while producing substantially higher rates of novel adjective-noun bigrams than the human-authored examples.

Significance. If the quantitative comparisons hold, the work supplies a reproducible resource for studying the full spectrum of humor, including its deliberately poor variants, and documents concrete limitations of current detection and generation systems. The public release of data, code, and analysis scripts is a clear strength that enables direct follow-up experiments.

major comments (2)
  1. [§4] §4 (Literary Device Analysis): the reported frequencies of metaphor, metafiction, and simile are presented as distinguishing the Bulwer-Lytton sentences from prior humor datasets, yet no inter-annotator agreement statistics or explicit annotation guidelines are supplied. Without these, it is impossible to determine whether the observed differences are robust or sensitive to annotator interpretation, directly affecting the claim that the corpus combines familiar and novel device profiles.
  2. [§5] §5 (LLM Synthesis and Bigrams): the claim that LLM-generated sentences contain 'far more novel adjective-noun bigrams' depends on an unspecified reference corpus for the novelty baseline. If the reference is drawn from general rather than literary-fiction text, the elevated rate may be an artifact of the contest's purple-prose format rather than a general property of LLM exaggeration of bad humor.
minor comments (2)
  1. [Abstract / §3] The abstract and §3 should state the exact corpus size, selection criteria for the analyzed subset, and any pre-specification of the device taxonomy before annotation began.
  2. [Figures/Tables in §4–5] Figure or table captions for the device-frequency and bigram-novelty comparisons should include the precise statistical test and correction method used to establish significance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the clarity and reproducibility of our work. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Literary Device Analysis): the reported frequencies of metaphor, metafiction, and simile are presented as distinguishing the Bulwer-Lytton sentences from prior humor datasets, yet no inter-annotator agreement statistics or explicit annotation guidelines are supplied. Without these, it is impossible to determine whether the observed differences are robust or sensitive to annotator interpretation, directly affecting the claim that the corpus combines familiar and novel device profiles.

    Authors: We agree that explicit annotation guidelines and inter-annotator agreement statistics improve the robustness of the literary device analysis. In the revised manuscript we will add the full annotation guidelines to an appendix. The device labels were produced by the first author following those guidelines, with the second author independently labeling a 20% subset for verification; we will report the resulting agreement statistics (Cohen's kappa) in the revised §4. These additions directly address concerns about annotator sensitivity while preserving the observed distributional differences between the Bulwer-Lytton corpus and prior humor datasets. revision: yes

  2. Referee: [§5] §5 (LLM Synthesis and Bigrams): the claim that LLM-generated sentences contain 'far more novel adjective-noun bigrams' depends on an unspecified reference corpus for the novelty baseline. If the reference is drawn from general rather than literary-fiction text, the elevated rate may be an artifact of the contest's purple-prose format rather than a general property of LLM exaggeration of bad humor.

    Authors: We will clarify in the revised §5 that novelty is defined relative to the human-authored Bulwer-Lytton sentences themselves (i.e., bigrams appearing in LLM outputs but absent from the human contest entries). This within-style comparison is intentional: it isolates whether LLMs exaggerate the contest's characteristic overuse of unusual adjective-noun combinations beyond what human writers produce in the same purple-prose format. We will also add a brief discussion acknowledging that the contest style itself favors rare bigrams, yet the substantially higher rate in the LLM outputs still indicates an exaggeration effect specific to the generation process. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical measurements on external corpus

full rationale

The paper's central claims rest on curating an external Bulwer-Lytton corpus, evaluating off-the-shelf humor detection models, manually annotating literary devices, and counting novel adjective-noun bigrams in human vs. LLM-generated text. These are direct observational comparisons against independent baselines and existing datasets. No equations, fitted parameters presented as predictions, self-citations for uniqueness theorems, or internal definitions that loop back to the target quantities appear in the provided text. The analysis is self-contained against external benchmarks and replication via the released data and code.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions about what counts as humor and how literary devices are identified; no new physical or mathematical entities are introduced.

axioms (2)
  • domain assumption Literary devices such as metaphor, simile, and metafiction can be reliably annotated in short sentences by human raters.
    The analysis of device usage depends on this annotation step being consistent.
  • domain assumption Standard humor detection models trained on existing datasets represent the current state of the art for general humor recognition.
    The claim that models perform poorly uses these as baselines.

pith-pipeline@v0.9.0 · 5665 in / 1320 out tokens · 25072 ms · 2026-05-18T03:06:02.236841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.