pith. sign in

arxiv: 2604.09854 · v1 · submitted 2026-04-10 · 💻 cs.CL

Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

Pith reviewed 2026-05-10 16:46 UTC · model grok-4.3

classification 💻 cs.CL
keywords narrative tensionstory generationLLM evaluationending predictionnarrative structurecreative writing
0
0 comments X

The pith

The 100-Endings metric uses LLM ending prediction mismatches to measure and improve narrative tension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLMs struggle to generate compelling stories and fail to recognize this on benchmarks where their judges prefer AI outputs over New Yorker stories. The paper introduces the 100-Endings metric to capture narrative tension by measuring how often an LLM's 100 ending predictions from each story prefix mismatch the actual ending. This metric, along with its curve-derived inflection rate, correctly ranks human literary stories higher than typical LLM ones. The authors show that a constrained generation pipeline using templates and scaffolding can boost this tension measure while preserving scores on the EQ-Bench leaderboard.

Core claim

The 100-Endings metric walks a story sentence by sentence; at each point an LLM predicts the ending 100 times from the text so far, and tension is the rate of mismatches with the ground truth ending. The resulting curve also yields an inflection rate that counts direction reversals as a proxy for twists. Unlike rubric-based judges, this metric ranks New Yorker stories above LLM-generated ones. A generation pipeline using story templates, idea formulation, and narrative scaffolding raises the tension score without harming EQ-Bench performance.

What carries the argument

The 100-Endings metric: at each sentence, run 100 ending predictions from the prefix and count mismatches with the true ending.

If this is right

  • The metric distinguishes high-tension literary stories from typical LLM outputs where rubrics do not.
  • Incorporating structural constraints during generation increases measured tension.
  • The sentence-level mismatch curve provides additional measures like inflection rate for analyzing narrative structure.
  • Improvements in tension can be achieved without sacrificing performance on existing creative writing benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Using the same model family for prediction and generation may introduce bias into the tension scores.
  • The forecasting method could extend to measuring unpredictability in other narrative forms such as scripts or interactive stories.
  • Correlating the metric with direct human judgments of suspense would test its validity as a tension proxy.

Load-bearing premise

That the rate of ending-prediction mismatches by an LLM predictor faithfully captures the narratological concept of tension, rather than reflecting predictor biases or surface-level features.

What would settle it

Run a reader study where participants rate the tension or suspense in matched stories and compare those ratings to the 100-Endings scores; absence of positive correlation would falsify the metric's claim.

Figures

Figures reproduced from arXiv: 2604.09854 by Ari Holtzman, Hoyt Long, Peiqi Sui, Peter West, Richard Jean So, Tianyi Cheng, Yutong Zhu.

Figure 1
Figure 1. Figure 1: Pipeline architecture and effects. Left: The story generation pipeline creates a to-do list template with a reference template (Step 0), combines it with a target story idea into a thick intermediate plan (Step 1), and generates the final story (Step 2). Right: Pipeline effects on shape of tension curves as measured by the 100-Endings metric analysis. 1 Introduction Even the largest and strongest LLMs have… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Illustration of the 100-Endings metric. At each sentence position, 100 candidate endings are generated and judged against the actual continuation; the fraction that diverges (no-rate) tracks narrative tension. Right: Example from Updike’s “Snowing in Greenwich Village” at 91.2% revealed (no-rate = 0.45). The matched ending preserves Updike’s restrained ambiguity; the diverged ending collapses the ten… view at source ↗
Figure 3
Figure 3. Figure 3: No-rate curves for two New Yorker stories illustrating how mean no-rate and inflection rate can diverge. Left: “Poor Girl” (Petrushevskaya), with very high mean no-rate (0.925) but a nearly flat curve (inflection rate 0.048). Right: “The Fellow” (Joy Williams), a surrealist story with low mean no-rate (0.403) but frequent sharp reversals (inflection rate 0.339). Red/blue triangles mark peaks/valleys with v… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of our narrative pipeline on rubric scores and tension metrics (32 EQ-Bench prompts). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: No-rate curve for Updike’s “Snowing in Greenwich Village.” Tension rises once the husband [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: No-rate curves for two stories generated by Sonnet 4.6 from the same prompt. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Tension-related EQ-Bench sub-criteria across three human-written corpora (higher [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overall (left) and late-stage (≥80%, right) mean no-rate by corpus. New Yorker fiction is the least predictable on both measures. Top-10 LLMs score lowest—below all human corpora, including amateurs—despite outscoring New Yorker on the EQ-Bench rubric ( [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Inflection rate by corpus at three vertex-angle thresholds ( [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Post-spike tension retention (%) by corpus. After a local peak in the no-rate curve, what [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

LLMs have so far failed both to generate consistently compelling stories and to recognize this failure--on the leading creative-writing benchmark (EQ-Bench), LLM judges rank zero-shot AI stories above New Yorker short stories, a gold standard for literary fiction. We argue that existing rubrics overlook a key dimension of compelling human stories: narrative tension. We introduce the 100-Endings metric, which walks through a story sentence by sentence: at each position, a model predicts how the story will end 100 times given only the text so far, and we measure tension as how often predictions fail to match the ground truth. Beyond the mismatch rate, the sentence-level curve yields complementary statistics, such as inflection rate, a geometric measure of how frequently the curve reverses direction, tracking twists and revelations. Unlike rubric-based judges, 100-Endings correctly ranks New Yorker stories far above LLM outputs. Grounded in narratological principles, we design a story-generation pipeline using structural constraints, including analysis of story templates, idea formulation, and narrative scaffolding. Our pipeline significantly increases narrative tension as measured by the 100-Endings metric, while maintaining performance on the EQ-Bench leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the 100-Endings metric, which quantifies narrative tension by sampling 100 ending predictions from an LLM at each sentence of a story and measuring the mismatch rate with the ground-truth continuation; it also derives complementary statistics such as inflection rate. The authors argue that this metric, unlike rubric-based judges on EQ-Bench, correctly ranks New Yorker short stories above zero-shot LLM outputs. They further present a story-generation pipeline incorporating template analysis, idea formulation, and narrative scaffolding that raises 100-Endings tension scores while preserving EQ-Bench performance.

Significance. If the 100-Endings mismatch rate isolates structural suspense rather than predictor alignment, the metric offers a falsifiable, sentence-level alternative to subjective rubrics for creative-writing evaluation and could guide more effective LLM story generation. The pipeline's ability to improve the new metric without EQ-Bench regression is a practical strength, and the provision of inflection-rate statistics adds analytical depth. The work is grounded in narratological ideas but its impact hinges on rigorous validation against human tension judgments and controls for distributional confounds.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (metric definition): the central claim that mismatch rate 'faithfully captures' narratological tension is load-bearing yet rests on the untested assumption that predictor failures are driven by suspense structure rather than training-distribution overlap. Because the 100-Endings predictor and the story generator are both LLMs, pipeline outputs may simply be more in-distribution, producing lower mismatch rates (apparently higher tension) even without structural change; an ablation with a held-out predictor family or correlation against human suspense ratings is required to support the ranking result.
  2. [§5] §5 (experiments): the reported ranking advantage of New Yorker stories and the pipeline's tension gains lack reported statistical tests, confidence intervals, or controls for story length, lexical diversity, and genre. Without these, it is unclear whether the 100-Endings advantage is robust or an artifact of surface features that also affect EQ-Bench.
minor comments (2)
  1. [§3] The sentence-level curve and inflection-rate definition would benefit from an explicit equation or pseudocode in §3 to allow exact reproduction.
  2. [§3] Clarify the exact LLM used for the 100-Endings predictor (model name, temperature, prompt template) so that the metric can be re-implemented without ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights key areas for strengthening the validation and statistical presentation of the 100-Endings metric. We address each major comment below and will incorporate revisions to improve rigor without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (metric definition): the central claim that mismatch rate 'faithfully captures' narratological tension is load-bearing yet rests on the untested assumption that predictor failures are driven by suspense structure rather than training-distribution overlap. Because the 100-Endings predictor and the story generator are both LLMs, pipeline outputs may simply be more in-distribution, producing lower mismatch rates (apparently higher tension) even without structural change; an ablation with a held-out predictor family or correlation against human suspense ratings is required to support the ranking result.

    Authors: We acknowledge that the metric's interpretation assumes mismatches primarily reflect structural suspense rather than distributional overlap between the predictor and generated text. The consistent outperformance of New Yorker stories over zero-shot LLM outputs across experiments provides supporting evidence aligned with narratological expectations, but we agree this does not fully isolate the cause. In the revised manuscript, we will add a new analysis correlating 100-Endings scores with human-annotated tension ratings on a held-out sample of 20 stories (balanced between human and LLM sources). This directly tests alignment with perceived suspense. A full ablation using a disjoint model family is computationally intensive and beyond the current scope, but we will explicitly discuss the limitation of shared model families and its implications for future work. revision: partial

  2. Referee: [§5] §5 (experiments): the reported ranking advantage of New Yorker stories and the pipeline's tension gains lack reported statistical tests, confidence intervals, or controls for story length, lexical diversity, and genre. Without these, it is unclear whether the 100-Endings advantage is robust or an artifact of surface features that also affect EQ-Bench.

    Authors: We agree that additional statistical controls and reporting are necessary to establish robustness. The revised manuscript will include bootstrap-derived 95% confidence intervals for all reported mean 100-Endings mismatch rates and inflection rates. We will also add linear regression models that control for story length (sentence count) and lexical diversity (type-token ratio), testing whether group differences remain significant after these covariates. Genre is held constant as all stories are short fiction, which we will state explicitly along with a discussion of potential remaining confounds. These additions will clarify that the observed advantages are not reducible to the listed surface features. revision: yes

Circularity Check

0 steps flagged

100-Endings metric defined from observable mismatches; no reduction to inputs by construction

full rationale

The 100-Endings metric is introduced directly as the rate at which an LLM predictor's 100 sampled endings fail to match the ground-truth continuation, yielding an observable mismatch frequency and inflection statistics. This construction does not reduce to fitted parameters, self-referential quantities, or author-specific ansatzes; the ranking of New Yorker stories above LLM outputs and the pipeline's reported tension increase are presented as empirical results measured against this independent definition. No load-bearing self-citations, uniqueness theorems, or renamings of known results appear in the derivation chain. The central claims remain self-contained against external benchmarks such as EQ-Bench, consistent with a minor or absent circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that prediction failure rate operationalizes narrative tension, with no free parameters or invented physical entities but a new methodological construct.

axioms (1)
  • domain assumption Narrative tension is captured by the frequency with which an LLM fails to predict the actual story ending from preceding text.
    This definition underpins the entire 100-Endings metric and is invoked to justify its superiority over rubric-based judges.
invented entities (1)
  • 100-Endings metric no independent evidence
    purpose: Quantify narrative tension via repeated ending forecasts and mismatch statistics
    Newly introduced construct without external validation or independent evidence cited in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1216 out tokens · 60635 ms · 2026-05-10T16:46:29.575339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    , month = jan, year =

    doi: 10.48550/arxiv.2312.06281. Samuel J Paech. Eq-bench creative writing benchmark v3. https://github.com/EQ-bench/ creative-writing-bench, 2025. Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations.arXiv preprint, 2024. Chau Minh Pham, Jenna Russell, Dzung Pham, and Mohit Iyyer. Frankentext: Stitchi...

  2. [2]

    doi: 10.18653/v1/2025.acl-long.1577

    Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long.1577. URL https://aclanthology.org/2025.acl-long.1577/. Yufei Tian, Bodhisattwa Prasad Majumder, and Yulia Tsvetkov. Are large language models capable of generating human-level narratives? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024...

  3. [3]

    Ghostbuster:

    Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.95. URL https://aclanthology.org/2024.naacl-long.95/. Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, and Ji-Rong Wen. CharacterBox: Evaluating the role-playing capabilities of LLMs in text-based virtual worlds. In Proceedings of the 2025 Annual Confere...

  4. [4]

    , ˆe(i) 100 conditioned on the prefix (s1,

    Predict100 independent ending predictions ˆe(i) 1 , . . . , ˆe(i) 100 conditioned on the prefix (s1, . . . ,s i), using a generation modelG

  5. [5]

    general direction

    Judgeeach prediction against the true remainder (si+1, . . . ,s N) using a judge model J , which returns a binary verdict j(i) k ∈ {0, 1} indicating whether ˆe(i) k captures the “general direction” of the true continuation. 3.Aggregateinto a position-levelno-rate: no-rate(i) =1− 1 100 100 ∑ k=1 j(i) k (4) 17 Preprint. Under review. A no-rate of 0.85 means...

  6. [6]

    This excludes positions following very short sentences (e.g., single-line dialogue tags) that provide minimal narrative context

    Minimum context: Position i is skipped if the preceding sentence si−1 contains fewer than 10 words (MIN_SENT_WORDS = 10). This excludes positions following very short sentences (e.g., single-line dialogue tags) that provide minimal narrative context

  7. [7]

    predictable because all readers would agree

    Token percentage window: We compute the fraction of story tokens revealed at each position using the tokenizer of the generation model. Only positions where 10%–99% of tokens have been revealed are retained. The lower bound excludes positions with insufficient context; the upper bound excludes trivial positions near the final sentence. Stories exceeding 5...

  8. [8]

    Mean no-rate.The primary metric. At each position i, the no-rate is the fraction of 100 predictions the judge rejects: no-rate(i) =1− 1 100 100 ∑ k=1 j(i) k (6) The story-level mean no-rate averages across all retained positions: no-rate= 1 |P | ∑ i∈P no-rate(i),P={i: 10%≤pct(i)≤99%}(7)

  9. [9]

    This measures whether the story remains unpredictable as it approaches its ending, or whether the conclusion becomes obvious once most of the narrative has been read

    Late-stage no-rate.The mean no-rate restricted to positions where ≥80% of the story has been revealed. This measures whether the story remains unpredictable as it approaches its ending, or whether the conclusion becomes obvious once most of the narrative has been read

  10. [10]

    The story-level metric is the mean of these drops across all peaks

    Post-spike convergence.For each local peak in the no-rate curve (a position where no-rate exceeds both neighbors), we compute the percentage drop from the peak to the minimum within the next 10 positions. The story-level metric is the mean of these drops across all peaks. More negative values indicate that tension peaks resolve more sharply—the story answ...

  11. [11]

    Before computing vertex angles, both axes are rescaled to [0, 1] (using each story’s own range) to fix a 1:1 aspect ratio

    Inflection rate.The fraction of positions at which the smoothed no-rate curve sharply reverses direction, as described in Section 3. Before computing vertex angles, both axes are rescaled to [0, 1] (using each story’s own range) to fix a 1:1 aspect ratio. We report at three thresholds: α∈ {30, 60, 120}. Group-level reporting.For a group of n stories, we r...

  12. [12]

    22 Preprint

    Step 0 — Warmup: Extract a structural beat sheet from a reference story (either a famous title or a full New Yorker story text). 22 Preprint. Under review

  13. [13]

    When warmup is absent, generates beats from scratch

    Step 1 — Beat Adaptation: Adapt the warmup beat sheet to the target story idea. When warmup is absent, generates beats from scratch

  14. [14]

    {ref_name}

    Step 2 — Story Writing: Write the final story from the adapted beats, or directly from the idea in vanilla mode. All API calls use the OpenRouter endpoint. Reasoning/thinking is disabled for all calls. Stories shorter than 200 characters are automatically discarded and regenerated. D.2 Shared Hyperparameters Table 8 lists the hyperparameters shared across...