The Challenge and Reward of Fair Play in Narrative: A Computational Approach
Pith reviewed 2026-05-19 04:11 UTC · model grok-4.3
The pith
Surprise and coherence trade off for any single reader model but coexist when distinguishing pre-revelation and post-resolution modes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our central theoretical result shows that surprise and coherence must trade off for any single reader model, but can coexist when two reader modes are distinguished: a pre-revelation mode that forms expectations while the ending is unknown, and a post-resolution hindsight mode that re-evaluates the story after the culprit is revealed. The balance of these two dimensions is realized in the common requirement of fair play, giving the reader a chance to solve the mystery while maintaining a challenge. We operationalize the framework using large language models as simulated readers, and define reference-less evaluation metrics for surprise, coherence, and fair play.
What carries the argument
The distinction between a pre-revelation mode that forms expectations while the ending is unknown and a post-resolution hindsight mode that re-evaluates the story after the culprit is revealed, which together enable fair play in an information-theoretic model of narrative.
If this is right
- While models generally succeed in creating surprise or coherence, achieving fair play poses a challenge even for strong models.
- Surprise and coherence do not positively correlate across stories, resisting reduction to a single latent quality.
- The metrics reproduce established literary intuitions, finding Christie's stories more surprising and more fair-playing than Conan Doyle's.
- A human study validates the metrics, confirming they capture aspects of narrative quality that matter to readers.
Where Pith is reading between the lines
- This framework could guide the creation of story generators that explicitly target the balance between the two reader modes rather than optimizing surprise or coherence in isolation.
- The same pre- and post-revelation distinction might apply to evaluating other revelation-based narratives such as films or games.
- Direct side-by-side testing of LLM reader simulations against human expectation-formation data could identify where the model approximations diverge from actual reader experience.
Load-bearing premise
Large language models prompted as readers can faithfully simulate the formation of human expectations before the reveal and the retrospective coherence judgments after the reveal, without systematic biases from their training data.
What would settle it
A controlled comparison where human readers rate the same stories for surprise and coherence shows no trade-off in single-mode judgments or fails to match the pre-revelation versus post-resolution distinction produced by the LLM simulations.
Figures
read the original abstract
Good storytelling involves surprise -- unpredictability in how the story unfolds -- and sense-making, the requirement that the story forms a coherent sequence. However, to date, these two qualities have largely been addressed in isolation. We formalize these qualities and their relationship in an information-theoretic framework, using detective fiction as a paradigm case of narratives in which a hidden truth is discovered through reasoning. Our central theoretical result shows that surprise and coherence must trade off for any *single* reader model, but can coexist when two reader modes are distinguished: a pre-revelation mode that forms expectations while the ending is unknown, and a post-resolution hindsight mode that re-evaluates the story after the culprit is revealed. The balance of these two dimensions is realized in the common requirement of *fair play*, giving the reader a chance to solve the mystery while maintaining a challenge. We operationalize the framework using large language models as simulated readers, and define reference-less evaluation metrics for surprise, coherence, and fair play. Experiments on LLM-generated stories validate our theoretical predictions: while models generally succeed in creating surprise or coherence, achieving fair play poses a challenge even for strong models. Moreover, surprise and coherence do not positively correlate across stories, resisting reduction to a single latent quality. A human study validates the metrics, confirming they capture aspects of narrative quality that matter to readers. Our metrics also reproduce established literary intuitions, finding Christie's stories more surprising and more fair-playing than Conan Doyle's.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes surprise (unpredictability of the ending under pre-reveal expectations) and coherence (retrospective sense-making after the reveal) in an information-theoretic framework for detective narratives. It proves that these qualities trade off for any single reader model but can be jointly achieved by distinguishing a pre-revelation mode (forming expectations while the culprit is unknown) from a post-resolution hindsight mode. This balance is identified with the literary notion of fair play. The framework is operationalized via reference-less metrics computed from LLM token probabilities under prefix-only and full-story prompts; experiments on LLM-generated stories, a human validation study, and comparisons of Christie vs. Conan Doyle texts are used to test predictions that fair play is difficult, that surprise and coherence do not positively correlate, and that the metrics recover established literary intuitions.
Significance. If the central result holds, the work supplies a principled, falsifiable bridge between information theory and narrative studies that could guide both automated story generation and quantitative literary analysis. Credit is due for the clean derivation of the single-model tradeoff, the explicit two-mode escape, the human study grounding the metrics, and the reproduction of author-specific intuitions without reference texts.
major comments (2)
- [Experiments and Metrics] The operationalization of the pre-revelation mode (prefix-only LLM prompts) is load-bearing for the empirical test of the information-theoretic tradeoff. Because LLMs are trained on complete detective stories, residual knowledge of the culprit can leak into prefix probabilities, so the measured surprise may already incorporate post-reveal information. This undermines the claim that the observed lack of positive correlation between surprise and coherence validates the two-mode distinction rather than prompt artifacts. A concrete control (e.g., stories withheld from training or explicit 'ignorance' fine-tuning) is needed.
- [Human Study] The human validation study is cited as confirming that the metrics capture reader-relevant aspects of narrative quality, yet the paper does not report inter-rater reliability, exact rating scales, or whether raters saw prefixes versus full stories. Without these details the study cannot fully adjudicate whether the LLM metrics isolate the pre- versus post-reveal modes as theorized.
minor comments (2)
- [Appendix] The exact prompting templates and temperature settings used for the LLM reader simulations should be provided in an appendix or supplementary material to allow reproduction.
- [Theoretical Framework] Notation for the information-theoretic quantities (e.g., the precise definitions of surprise as negative log probability and coherence as retrospective probability) could be collected in a single table for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify important considerations for strengthening the empirical grounding of our framework. We respond to each major point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Experiments and Metrics] The operationalization of the pre-revelation mode (prefix-only LLM prompts) is load-bearing for the empirical test of the information-theoretic tradeoff. Because LLMs are trained on complete detective stories, residual knowledge of the culprit can leak into prefix probabilities, so the measured surprise may already incorporate post-reveal information. This undermines the claim that the observed lack of positive correlation between surprise and coherence validates the two-mode distinction rather than prompt artifacts. A concrete control (e.g., stories withheld from training or explicit 'ignorance' fine-tuning) is needed.
Authors: We agree that training-data leakage is a substantive concern for any LLM-based simulation of reader expectations, particularly for canonical texts. Our primary validation experiments, however, rely on stories generated by the models themselves during the study; these narratives are produced on the fly and therefore absent from pre-training corpora. For the Christie–Conan Doyle comparisons we acknowledge the limitation and will add an explicit limitations subsection discussing potential leakage effects and their bearing on the observed lack of positive correlation. A full withheld-story control or ignorance fine-tuning lies outside the present computational budget but could be pursued in follow-up work; we will instead report a supplementary analysis using low-frequency synthetic plots to probe robustness. revision: partial
-
Referee: [Human Study] The human validation study is cited as confirming that the metrics capture reader-relevant aspects of narrative quality, yet the paper does not report inter-rater reliability, exact rating scales, or whether raters saw prefixes versus full stories. Without these details the study cannot fully adjudicate whether the LLM metrics isolate the pre- versus post-reveal modes as theorized.
Authors: We apologize for the omission of these details. The revised manuscript will expand the human-study section to report inter-rater reliability (Cronbach’s α), the precise 1–7 Likert scales employed for surprise and coherence, and the protocol that raters evaluated prefixes when assessing surprise and full stories when assessing coherence, thereby aligning the human judgments with the pre- and post-revelation modes. These additions will allow readers to evaluate how directly the study supports the two-mode distinction. revision: yes
Circularity Check
Information-theoretic derivation of surprise-coherence tradeoff is self-contained and independent of LLM simulations
full rationale
The paper's central theoretical result is presented as a direct consequence of formalizing surprise and coherence within an information-theoretic framework for reader models. This establishes a necessary tradeoff under a single mode and its relaxation under distinct pre-revelation and post-resolution modes without any reduction to fitted parameters, empirical data, or self-citations. The subsequent operationalization via LLM prompting and reference-less metrics is treated as a separate validation step, cross-checked against human judgments and classic texts, so the theoretical claim does not collapse into its experimental inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Reader expectations and retrospective coherence can be quantified via information-theoretic quantities such as surprise and mutual information.
- domain assumption Large language models can be prompted to simulate distinct pre- and post-revelation reader states.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our central theoretical result shows that surprise and coherence must trade off for any single reader model, but can coexist when two reader modes are distinguished: a pre-revelation mode... and a post-resolution hindsight mode... We formalize these qualities... using an information-theoretic framework... H(M∞(i); M(i)) ... C-EFFM(i+1) ... δin(i) := Σ C-EFFM1(j)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem (Coherence-surprise tradeoffs). Assume an externally coherent story model SM. 1. A reader M cannot be both strongly surprised and intelligent. 2. ... internal coherence and surprise trade off.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Dana Royce Baerger and Dan P McAdams
URL https://arxiv.org/abs/ 2504.11900. Dana Royce Baerger and Dan P McAdams. Life story coherence and its relation to psychological well-being. Narrative inquiry, 9(1):69–96,
-
[2]
Association for Computational Linguistics. ISBN 979-8-89176-247-3. URL https: //aclanthology.org/2025.wnu-1.7/. Ernst Bloch, Roswitha Mueller, and Stephen Thaman. A philosophical view of the detective novel. Discourse, 2:32–52,
work page 2025
-
[3]
URL https: //arxiv.org/abs/2505.07601. Maksym Del and Mark Fishel. True detective: A deep abductive reasoning benchmark undoable for GPT-3 and challenging for GPT-4. In Alexis Palmer and Jose Camacho-collados, editors, Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 314–322, Toronto, Canada, July
-
[4]
doi: 10.18653/v1/2023.starsem-1.28
Association for Computational Linguistics. doi: 10.18653/v1/2023.starsem-1.28. URL https://aclanthology.org/2023.starsem-1.28. Lorenz Demey. The dynamics of surprise. Logique et Analyse, 58(230):251–277,
-
[5]
doi: 10.1126/ sciadv.adn5290. URL https://www.science.org/doi/abs/10.1126/sciadv.adn5290. Malcah Effron. American Golden Age Crime Fiction, page 167–178. Cambridge University Press,
-
[6]
URL https://aclanthology.org/Q18-1001/
doi: 10.1162/tacl_a_00001. URL https://aclanthology.org/Q18-1001/. Antonios Georgiou, Tankut Can, Mikhail Katkov, and Misha Tsodyks. Large-scale study of human memory for meaningful narratives. Learning & Memory, 32(2):a054043,
-
[7]
doi: 10.1111/j.1540-6245.2011.01470.x
ISSN 0021-8529. doi: 10.1111/j.1540-6245.2011.01470.x. URL https: //doi.org/10.1111/j.1540-6245.2011.01470.x. John Hale. A probabilistic Earley parser as a psycholinguistic model. In Second Meeting of the North American Chapter of the Association for Computational Linguistics ,
-
[8]
URL https://aclanthology.org/2024.scalellm-1.5/
Association for Computational Linguistics. URL https://aclanthology.org/2024.scalellm-1.5/. Sil Hamilton, Rebecca M. M. Hicke, Matthew Wilkens, and David Mimno. Too long, didn’t model: Decomposing llm long-context understanding with novels,
work page 2024
-
[9]
Hans Ole Hatzel, Haimo Stiemer, Chris Biemann, and Evelyn Gius
URL https://arxiv.org/ abs/2505.14925. Hans Ole Hatzel, Haimo Stiemer, Chris Biemann, and Evelyn Gius. Machine learning in computa- tional literary studies. it-Information Technology, 65(4-5):200–217,
-
[10]
Understanding the planning of LLM agents: A survey
URL https://arxiv.org/abs/2402.02716. Peter Hühn. The detective as reader: Narrativity and reading concepts in detective fiction. MFS Modern Fiction Studies, 33(3):451–466,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Stephen Knight. The golden age, page 77–94. Cambridge Companions to Literature. Cambridge University Press, 2003a. Stephen Thomas Knight. Crime fiction 1800-2000: Detection, death, diversity. Palgrave Macmillan, 2003b. R.A. Knox and H. Harrington. The Best English Detective Stories of
work page 2000
-
[12]
URL https://aclanthology.org/Q18-1023/
doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023/. Michal Kosinski. Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences, 121(45):e2405460121,
-
[13]
URL https://arxiv.org/abs/ 2407.02446. Sarah J Link. Defining detective fiction. In A Narratological Approach to Lists in Detective Fiction, pages 17–38. Springer, 2023a. Sarah J Link. Dossier novels: The reader as detective. In A Narratological Approach to Lists in Detective Fiction, pages 39–84. Springer, 2023b. Franco Moretti. The slaughterhouse of lit...
-
[14]
A corpus and cloze evaluation for deeper under- standing of commonsense stories
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Van- derwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper under- standing of commonsense stories. In Kevin Knight, Ani Nenkova, and Owen Rambow, edi- tors, Proceedings of the 2016 Conference of the North American Chapter of the Association f...
work page 2016
-
[15]
A corpus and cloze evaluation for deeper understanding of commonsense stories
Association for Computational Linguistics. doi: 10.18653/v1/N16-1098. URL https://aclanthology.org/N16-1098/. Kotaro Nishigori and Hideaki Takeda. Evaluating narrative coherence in collaborative storytelling with generative ai. In Proceedings of the 2025 Conference on Creativity and Cognition , C&C ’25, page 443–447, New York, NY , USA,
-
[16]
Association for Computing Machinery. ISBN 9798400712890. doi: 10.1145/3698061.3734393. URL https://doi.org/10.1145/3698061. 3734393. Andrew Ortony and Derek Partridge. Surprisingness and expectation failure: what’s the difference? In IJCAI, volume 87, pages 106–108,
-
[17]
PlotMachines: Outline-conditioned generation with dynamic plot state tracking
Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and Jianfeng Gao. PlotMachines: Outline-conditioned generation with dynamic plot state tracking. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4274–4295, Online, November
work page 2020
-
[18]
doi: 10.18653/v1/2020.emnlp-main.349
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.349. URL https://aclanthology.org/2020. emnlp-main.349/. Elaine Reese, Catherine A Haden, Lynne Baker-Ward, Patricia Bauer, Robyn Fivush, and Peter A Ornstein. Coherence of personal narratives across the lifespan: A multidimensional model and coding method. Journal of cognition an...
-
[19]
URL http://www.jstor.org/stable/23363477
ISSN 1040726X, 1573336X. URL http://www.jstor.org/stable/23363477. Lion Schulz, Miguel Patrício, and Daan Odijk. Narrative information theory,
-
[20]
URL https: //arxiv.org/abs/2411.12907. Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423,
-
[21]
URL http://www.jstor.org/stable/1499897
ISSN 0043373X. URL http://www.jstor.org/stable/1499897. Nathaniel J Smith and Roger Levy. Optimal processing times in reading: A formal model and empirical investigation. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 30,
-
[22]
URL https://doi.org/10.3366/cfs.2023.0096
doi: 10.3366/cfs.2023.0096. URL https://doi.org/10.3366/cfs.2023.0096. Prashanth Vijayaraghavan and Deb Roy. M-sense: Modeling narrative structure in short personal narratives using protagonist’s mental representations,
-
[23]
URL https://arxiv.org/abs/ 2302.09418. Eitan Wagner and Omri Abend. What do language model probabilities represent? from distribution estimation to response prediction. arXiv preprint arXiv:2505.02072,
-
[24]
20 David Wilmot and Frank Keller
URL https://arxiv.org/abs/2506.10161. 20 David Wilmot and Frank Keller. Modelling suspense in short stories as uncertainty reduction over neural representation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1763–1788, Online, July
-
[25]
Association for Computational Linguistics. doi: 10.18653/v1/2020. acl-main.161. URL https://aclanthology.org/2020.acl-main.161. Kaige Xie and Mark Riedl. Creating suspenseful stories: Iterative planning with large language models. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association fo...
-
[26]
URL https://aclanthology.org/2024.eacl-long.147/
Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.147/. Anbang Ye, Christopher Cui, Taiwei Shi, and Mark O. Riedl. Neural story planning,
work page 2024
-
[27]
URL https://arxiv.org/abs/2212.08718. A Proofs A.1 Intelligence and clue probabilities Here, we show that reader intelligence can be interpreted as a result of similar clue probability judgments among detectives. Lemma
-
[28]
This definition was generally adopted in surprisal theory [Smith and Levy, 2008]
defines the surprise as the log-probability assigned to the observed word. This definition was generally adopted in surprisal theory [Smith and Levy, 2008]. In their famous work, Ely et al
work page 2008
-
[29]
In our work, we assess the surprisal regarding the identity of the culprit
define a similar metric (“pivot”) for a general property of the story. In our work, we assess the surprisal regarding the identity of the culprit. A challenge with this formulation is that it captures changes regardless of their direction. In the case of a story with a distractor, it will, on one hand, capture belief changes from the distractor to the tru...
work page 2020
-
[30]
apply the cognitive psychology Narrative Coherence Coding Scheme [NaCCS; Reese et al., 2011] to evaluate coherence of personal narratives generated by LLMs. Nishigori and Takeda
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.