pith. sign in

arxiv: 2507.13841 · v2 · submitted 2025-07-18 · 💻 cs.CL

The Challenge and Reward of Fair Play in Narrative: A Computational Approach

Pith reviewed 2026-05-19 04:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords narrative qualitysurprisecoherencefair playdetective fictioncomputational evaluationlarge language modelsinformation theory
0
0 comments X

The pith

Surprise and coherence trade off for any single reader model but coexist when distinguishing pre-revelation and post-resolution modes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that surprise and coherence in storytelling trade off against each other when viewed from one perspective but can both be present when readers switch between anticipating an unknown ending and reassessing the plot once the truth is known. This matters because it formalizes the idea of fair play in detective stories, where the narrative gives readers a genuine chance to figure out the mystery while ensuring the solution feels logical in retrospect. The work uses large language models to act as simulated readers and develops metrics to measure these qualities without needing reference stories. Experiments confirm that models can produce either surprising or coherent stories but find it hard to balance both through fair play, and that surprise and coherence do not simply go together. A human study further supports that these metrics reflect what actual readers value in narratives.

Core claim

Our central theoretical result shows that surprise and coherence must trade off for any single reader model, but can coexist when two reader modes are distinguished: a pre-revelation mode that forms expectations while the ending is unknown, and a post-resolution hindsight mode that re-evaluates the story after the culprit is revealed. The balance of these two dimensions is realized in the common requirement of fair play, giving the reader a chance to solve the mystery while maintaining a challenge. We operationalize the framework using large language models as simulated readers, and define reference-less evaluation metrics for surprise, coherence, and fair play.

What carries the argument

The distinction between a pre-revelation mode that forms expectations while the ending is unknown and a post-resolution hindsight mode that re-evaluates the story after the culprit is revealed, which together enable fair play in an information-theoretic model of narrative.

If this is right

  • While models generally succeed in creating surprise or coherence, achieving fair play poses a challenge even for strong models.
  • Surprise and coherence do not positively correlate across stories, resisting reduction to a single latent quality.
  • The metrics reproduce established literary intuitions, finding Christie's stories more surprising and more fair-playing than Conan Doyle's.
  • A human study validates the metrics, confirming they capture aspects of narrative quality that matter to readers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework could guide the creation of story generators that explicitly target the balance between the two reader modes rather than optimizing surprise or coherence in isolation.
  • The same pre- and post-revelation distinction might apply to evaluating other revelation-based narratives such as films or games.
  • Direct side-by-side testing of LLM reader simulations against human expectation-formation data could identify where the model approximations diverge from actual reader experience.

Load-bearing premise

Large language models prompted as readers can faithfully simulate the formation of human expectations before the reveal and the retrospective coherence judgments after the reveal, without systematic biases from their training data.

What would settle it

A controlled comparison where human readers rate the same stories for surprise and coherence shows no trade-off in single-mode judgments or fails to match the pre-revelation versus post-resolution distinction produced by the LLM simulations.

Figures

Figures reproduced from arXiv: 2507.13841 by Eitan Wagner, Omri Abend, Renana Keydar.

Figure 1
Figure 1. Figure 1: Schematic map of the reader models, by inference ability and knowledge of the world and storytelling. Another internal reader is the (internal) brilliant-detective reader, M1, which simu￾lates the brilliant detective: M1(X1...i, W) = D1(C1...i) (5) For the brilliant-detective reader, Mt (i) is ex￾pected to generally increase with i, from a uni￾form distribution over Y, towards 1 before the end of the suspi… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of idealized reader model curves. In the Introduction phase, all readers should [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Suprisal plot for Sherlock Holmes and Hercule Poirot stories. The curves present the [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Whisker plot for the different models. The bars represent the interquartile range (IQR), [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Good storytelling involves surprise -- unpredictability in how the story unfolds -- and sense-making, the requirement that the story forms a coherent sequence. However, to date, these two qualities have largely been addressed in isolation. We formalize these qualities and their relationship in an information-theoretic framework, using detective fiction as a paradigm case of narratives in which a hidden truth is discovered through reasoning. Our central theoretical result shows that surprise and coherence must trade off for any *single* reader model, but can coexist when two reader modes are distinguished: a pre-revelation mode that forms expectations while the ending is unknown, and a post-resolution hindsight mode that re-evaluates the story after the culprit is revealed. The balance of these two dimensions is realized in the common requirement of *fair play*, giving the reader a chance to solve the mystery while maintaining a challenge. We operationalize the framework using large language models as simulated readers, and define reference-less evaluation metrics for surprise, coherence, and fair play. Experiments on LLM-generated stories validate our theoretical predictions: while models generally succeed in creating surprise or coherence, achieving fair play poses a challenge even for strong models. Moreover, surprise and coherence do not positively correlate across stories, resisting reduction to a single latent quality. A human study validates the metrics, confirming they capture aspects of narrative quality that matter to readers. Our metrics also reproduce established literary intuitions, finding Christie's stories more surprising and more fair-playing than Conan Doyle's.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes surprise (unpredictability of the ending under pre-reveal expectations) and coherence (retrospective sense-making after the reveal) in an information-theoretic framework for detective narratives. It proves that these qualities trade off for any single reader model but can be jointly achieved by distinguishing a pre-revelation mode (forming expectations while the culprit is unknown) from a post-resolution hindsight mode. This balance is identified with the literary notion of fair play. The framework is operationalized via reference-less metrics computed from LLM token probabilities under prefix-only and full-story prompts; experiments on LLM-generated stories, a human validation study, and comparisons of Christie vs. Conan Doyle texts are used to test predictions that fair play is difficult, that surprise and coherence do not positively correlate, and that the metrics recover established literary intuitions.

Significance. If the central result holds, the work supplies a principled, falsifiable bridge between information theory and narrative studies that could guide both automated story generation and quantitative literary analysis. Credit is due for the clean derivation of the single-model tradeoff, the explicit two-mode escape, the human study grounding the metrics, and the reproduction of author-specific intuitions without reference texts.

major comments (2)
  1. [Experiments and Metrics] The operationalization of the pre-revelation mode (prefix-only LLM prompts) is load-bearing for the empirical test of the information-theoretic tradeoff. Because LLMs are trained on complete detective stories, residual knowledge of the culprit can leak into prefix probabilities, so the measured surprise may already incorporate post-reveal information. This undermines the claim that the observed lack of positive correlation between surprise and coherence validates the two-mode distinction rather than prompt artifacts. A concrete control (e.g., stories withheld from training or explicit 'ignorance' fine-tuning) is needed.
  2. [Human Study] The human validation study is cited as confirming that the metrics capture reader-relevant aspects of narrative quality, yet the paper does not report inter-rater reliability, exact rating scales, or whether raters saw prefixes versus full stories. Without these details the study cannot fully adjudicate whether the LLM metrics isolate the pre- versus post-reveal modes as theorized.
minor comments (2)
  1. [Appendix] The exact prompting templates and temperature settings used for the LLM reader simulations should be provided in an appendix or supplementary material to allow reproduction.
  2. [Theoretical Framework] Notation for the information-theoretic quantities (e.g., the precise definitions of surprise as negative log probability and coherence as retrospective probability) could be collected in a single table for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify important considerations for strengthening the empirical grounding of our framework. We respond to each major point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Experiments and Metrics] The operationalization of the pre-revelation mode (prefix-only LLM prompts) is load-bearing for the empirical test of the information-theoretic tradeoff. Because LLMs are trained on complete detective stories, residual knowledge of the culprit can leak into prefix probabilities, so the measured surprise may already incorporate post-reveal information. This undermines the claim that the observed lack of positive correlation between surprise and coherence validates the two-mode distinction rather than prompt artifacts. A concrete control (e.g., stories withheld from training or explicit 'ignorance' fine-tuning) is needed.

    Authors: We agree that training-data leakage is a substantive concern for any LLM-based simulation of reader expectations, particularly for canonical texts. Our primary validation experiments, however, rely on stories generated by the models themselves during the study; these narratives are produced on the fly and therefore absent from pre-training corpora. For the Christie–Conan Doyle comparisons we acknowledge the limitation and will add an explicit limitations subsection discussing potential leakage effects and their bearing on the observed lack of positive correlation. A full withheld-story control or ignorance fine-tuning lies outside the present computational budget but could be pursued in follow-up work; we will instead report a supplementary analysis using low-frequency synthetic plots to probe robustness. revision: partial

  2. Referee: [Human Study] The human validation study is cited as confirming that the metrics capture reader-relevant aspects of narrative quality, yet the paper does not report inter-rater reliability, exact rating scales, or whether raters saw prefixes versus full stories. Without these details the study cannot fully adjudicate whether the LLM metrics isolate the pre- versus post-reveal modes as theorized.

    Authors: We apologize for the omission of these details. The revised manuscript will expand the human-study section to report inter-rater reliability (Cronbach’s α), the precise 1–7 Likert scales employed for surprise and coherence, and the protocol that raters evaluated prefixes when assessing surprise and full stories when assessing coherence, thereby aligning the human judgments with the pre- and post-revelation modes. These additions will allow readers to evaluate how directly the study supports the two-mode distinction. revision: yes

Circularity Check

0 steps flagged

Information-theoretic derivation of surprise-coherence tradeoff is self-contained and independent of LLM simulations

full rationale

The paper's central theoretical result is presented as a direct consequence of formalizing surprise and coherence within an information-theoretic framework for reader models. This establishes a necessary tradeoff under a single mode and its relaxation under distinct pre-revelation and post-resolution modes without any reduction to fitted parameters, empirical data, or self-citations. The subsequent operationalization via LLM prompting and reference-less metrics is treated as a separate validation step, cross-checked against human judgments and classic texts, so the theoretical claim does not collapse into its experimental inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on information-theoretic modeling of reader expectations and the assumption that LLMs can proxy human narrative processing; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Reader expectations and retrospective coherence can be quantified via information-theoretic quantities such as surprise and mutual information.
    Invoked to formalize surprise and coherence and to derive the tradeoff result.
  • domain assumption Large language models can be prompted to simulate distinct pre- and post-revelation reader states.
    Required for the operationalization of the metrics and the experimental validation.

pith-pipeline@v0.9.0 · 5794 in / 1530 out tokens · 61292 ms · 2026-05-19T04:11:24.730967+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our central theoretical result shows that surprise and coherence must trade off for any single reader model, but can coexist when two reader modes are distinguished: a pre-revelation mode... and a post-resolution hindsight mode... We formalize these qualities... using an information-theoretic framework... H(M∞(i); M(i)) ... C-EFFM(i+1) ... δin(i) := Σ C-EFFM1(j)

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Theorem (Coherence-surprise tradeoffs). Assume an externally coherent story model SM. 1. A reader M cannot be both strongly surprised and intelligent. 2. ... internal coherence and surprise trade off.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Dana Royce Baerger and Dan P McAdams

    URL https://arxiv.org/abs/ 2504.11900. Dana Royce Baerger and Dan P McAdams. Life story coherence and its relation to psychological well-being. Narrative inquiry, 9(1):69–96,

  2. [2]

    ISBN 979-8-89176-247-3

    Association for Computational Linguistics. ISBN 979-8-89176-247-3. URL https: //aclanthology.org/2025.wnu-1.7/. Ernst Bloch, Roswitha Mueller, and Stephen Thaman. A philosophical view of the detective novel. Discourse, 2:32–52,

  3. [3]

    Maksym Del and Mark Fishel

    URL https: //arxiv.org/abs/2505.07601. Maksym Del and Mark Fishel. True detective: A deep abductive reasoning benchmark undoable for GPT-3 and challenging for GPT-4. In Alexis Palmer and Jose Camacho-collados, editors, Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 314–322, Toronto, Canada, July

  4. [4]

    doi: 10.18653/v1/2023.starsem-1.28

    Association for Computational Linguistics. doi: 10.18653/v1/2023.starsem-1.28. URL https://aclanthology.org/2023.starsem-1.28. Lorenz Demey. The dynamics of surprise. Logique et Analyse, 58(230):251–277,

  5. [5]

    Doshi and Oliver P

    doi: 10.1126/ sciadv.adn5290. URL https://www.science.org/doi/abs/10.1126/sciadv.adn5290. Malcah Effron. American Golden Age Crime Fiction, page 167–178. Cambridge University Press,

  6. [6]

    URL https://aclanthology.org/Q18-1001/

    doi: 10.1162/tacl_a_00001. URL https://aclanthology.org/Q18-1001/. Antonios Georgiou, Tankut Can, Mikhail Katkov, and Misha Tsodyks. Large-scale study of human memory for meaningful narratives. Learning & Memory, 32(2):a054043,

  7. [7]

    doi: 10.1111/j.1540-6245.2011.01470.x

    ISSN 0021-8529. doi: 10.1111/j.1540-6245.2011.01470.x. URL https: //doi.org/10.1111/j.1540-6245.2011.01470.x. John Hale. A probabilistic Earley parser as a psycholinguistic model. In Second Meeting of the North American Chapter of the Association for Computational Linguistics ,

  8. [8]

    URL https://aclanthology.org/2024.scalellm-1.5/

    Association for Computational Linguistics. URL https://aclanthology.org/2024.scalellm-1.5/. Sil Hamilton, Rebecca M. M. Hicke, Matthew Wilkens, and David Mimno. Too long, didn’t model: Decomposing llm long-context understanding with novels,

  9. [9]

    Hans Ole Hatzel, Haimo Stiemer, Chris Biemann, and Evelyn Gius

    URL https://arxiv.org/ abs/2505.14925. Hans Ole Hatzel, Haimo Stiemer, Chris Biemann, and Evelyn Gius. Machine learning in computa- tional literary studies. it-Information Technology, 65(4-5):200–217,

  10. [10]

    Understanding the planning of LLM agents: A survey

    URL https://arxiv.org/abs/2402.02716. Peter Hühn. The detective as reader: Narrativity and reading concepts in detective fiction. MFS Modern Fiction Studies, 33(3):451–466,

  11. [11]

    The golden age, page 77–94

    Stephen Knight. The golden age, page 77–94. Cambridge Companions to Literature. Cambridge University Press, 2003a. Stephen Thomas Knight. Crime fiction 1800-2000: Detection, death, diversity. Palgrave Macmillan, 2003b. R.A. Knox and H. Harrington. The Best English Detective Stories of

  12. [12]

    URL https://aclanthology.org/Q18-1023/

    doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023/. Michal Kosinski. Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences, 121(45):e2405460121,

  13. [13]

    Sarah J Link

    URL https://arxiv.org/abs/ 2407.02446. Sarah J Link. Defining detective fiction. In A Narratological Approach to Lists in Detective Fiction, pages 17–38. Springer, 2023a. Sarah J Link. Dossier novels: The reader as detective. In A Narratological Approach to Lists in Detective Fiction, pages 39–84. Springer, 2023b. Franco Moretti. The slaughterhouse of lit...

  14. [14]

    A corpus and cloze evaluation for deeper under- standing of commonsense stories

    Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Van- derwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper under- standing of commonsense stories. In Kevin Knight, Ani Nenkova, and Owen Rambow, edi- tors, Proceedings of the 2016 Conference of the North American Chapter of the Association f...

  15. [15]

    A corpus and cloze evaluation for deeper understanding of commonsense stories

    Association for Computational Linguistics. doi: 10.18653/v1/N16-1098. URL https://aclanthology.org/N16-1098/. Kotaro Nishigori and Hideaki Takeda. Evaluating narrative coherence in collaborative storytelling with generative ai. In Proceedings of the 2025 Conference on Creativity and Cognition , C&C ’25, page 443–447, New York, NY , USA,

  16. [16]

    ISBN 9798400712890

    Association for Computing Machinery. ISBN 9798400712890. doi: 10.1145/3698061.3734393. URL https://doi.org/10.1145/3698061. 3734393. Andrew Ortony and Derek Partridge. Surprisingness and expectation failure: what’s the difference? In IJCAI, volume 87, pages 106–108,

  17. [17]

    PlotMachines: Outline-conditioned generation with dynamic plot state tracking

    Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and Jianfeng Gao. PlotMachines: Outline-conditioned generation with dynamic plot state tracking. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4274–4295, Online, November

  18. [18]

    doi: 10.18653/v1/2020.emnlp-main.349

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.349. URL https://aclanthology.org/2020. emnlp-main.349/. Elaine Reese, Catherine A Haden, Lynne Baker-Ward, Patricia Bauer, Robyn Fivush, and Peter A Ornstein. Coherence of personal narratives across the lifespan: A multidimensional model and coding method. Journal of cognition an...

  19. [19]

    URL http://www.jstor.org/stable/23363477

    ISSN 1040726X, 1573336X. URL http://www.jstor.org/stable/23363477. Lion Schulz, Miguel Patrício, and Daan Odijk. Narrative information theory,

  20. [20]

    Claude Elwood Shannon

    URL https: //arxiv.org/abs/2411.12907. Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423,

  21. [21]

    URL http://www.jstor.org/stable/1499897

    ISSN 0043373X. URL http://www.jstor.org/stable/1499897. Nathaniel J Smith and Roger Levy. Optimal processing times in reading: A formal model and empirical investigation. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 30,

  22. [22]

    URL https://doi.org/10.3366/cfs.2023.0096

    doi: 10.3366/cfs.2023.0096. URL https://doi.org/10.3366/cfs.2023.0096. Prashanth Vijayaraghavan and Deb Roy. M-sense: Modeling narrative structure in short personal narratives using protagonist’s mental representations,

  23. [23]

    Eitan Wagner and Omri Abend

    URL https://arxiv.org/abs/ 2302.09418. Eitan Wagner and Omri Abend. What do language model probabilities represent? from distribution estimation to response prediction. arXiv preprint arXiv:2505.02072,

  24. [24]

    20 David Wilmot and Frank Keller

    URL https://arxiv.org/abs/2506.10161. 20 David Wilmot and Frank Keller. Modelling suspense in short stories as uncertainty reduction over neural representation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1763–1788, Online, July

  25. [25]

    In: Cohn, T., He, Y., Liu, Y

    Association for Computational Linguistics. doi: 10.18653/v1/2020. acl-main.161. URL https://aclanthology.org/2020.acl-main.161. Kaige Xie and Mark Riedl. Creating suspenseful stories: Iterative planning with large language models. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association fo...

  26. [26]

    URL https://aclanthology.org/2024.eacl-long.147/

    Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.147/. Anbang Ye, Christopher Cui, Taiwei Shi, and Mark O. Riedl. Neural story planning,

  27. [27]

    A Proofs A.1 Intelligence and clue probabilities Here, we show that reader intelligence can be interpreted as a result of similar clue probability judgments among detectives

    URL https://arxiv.org/abs/2212.08718. A Proofs A.1 Intelligence and clue probabilities Here, we show that reader intelligence can be interpreted as a result of similar clue probability judgments among detectives. Lemma

  28. [28]

    This definition was generally adopted in surprisal theory [Smith and Levy, 2008]

    defines the surprise as the log-probability assigned to the observed word. This definition was generally adopted in surprisal theory [Smith and Levy, 2008]. In their famous work, Ely et al

  29. [29]

    In our work, we assess the surprisal regarding the identity of the culprit

    define a similar metric (“pivot”) for a general property of the story. In our work, we assess the surprisal regarding the identity of the culprit. A challenge with this formulation is that it captures changes regardless of their direction. In the case of a story with a distractor, it will, on one hand, capture belief changes from the distractor to the tru...

  30. [30]

    Nishigori and Takeda

    apply the cognitive psychology Narrative Coherence Coding Scheme [NaCCS; Reese et al., 2011] to evaluate coherence of personal narratives generated by LLMs. Nishigori and Takeda