pith. sign in

arxiv: 2605.17064 · v1 · pith:SNBJQ56Pnew · submitted 2026-05-16 · 💻 cs.AI

Towards Human-Level Book-Writing Capability

Pith reviewed 2026-05-20 15:37 UTC · model grok-4.3

classification 💻 cs.AI
keywords book generationcreative writinglong-context modelsmulti-resolution summariesprompt-to-bookliterary fictionnarrative planningsupervised fine-tuning
0
0 comments X

The pith

Training on prompt-to-book trajectories from novel summaries shifts AI writing toward human literary style.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs training data by breaking public-domain novels into a hierarchy of summaries, from overall premise down to chapter and scene details. It then trains a long-context model to start from a high-level prompt, generate the intermediate plans at each level of detail, and finally produce the original human-written book text. This inverts the summarization process so that the model learns to expand outward while keeping authentic literary prose as the direct target. The goal is to move past the safe, explanatory style of instruction-tuned models and toward behaviors such as moral ambiguity and unreliable narration that appear in human fiction. A reader would care because current AI stories often feel structurally sound yet stylistically flat and disconnected from actual literary practice.

Core claim

The authors claim that reframing book-scale generation as the task of expanding a prompt through successively finer plans derived from real novels, and using the original human text as the final supervised target, makes large-scale creative writing learnable while preserving the stylistic and narrative qualities that standard assistant models are trained to suppress.

What carries the argument

The multi-resolution planning scaffold obtained by summarizing each novel at progressively finer levels, which is inverted during training so the model generates plans before expanding them into full human-authored prose.

If this is right

  • Outputs would sustain coherent narrative arcs across many chapters instead of drifting into generic explanations.
  • Stories would naturally incorporate unreliable narrators and moral complexity drawn from the human targets.
  • The final prose would remain closer to the stylistic range of published fiction rather than defaulting to clear, helpful assistant language.
  • Book-length generation becomes tractable because the intermediate summary levels provide stepwise supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hierarchical expansion approach could apply to other long creative forms such as screenplays or game narratives.
  • Models trained this way might serve as collaborative drafting tools that produce initial versions closer to human literary drafts.
  • The method points toward using real creative artifacts as training signals for any task where current models default to overly safe or explanatory behavior.

Load-bearing premise

The summaries created from public-domain novels supply a planning structure rich enough to let the model learn full book generation without losing the original literary qualities of the target text.

What would settle it

Generate stories from the trained model and a baseline model on the same prompts, then check whether human readers or automated measures detect more frequent use of deception, moral ambiguity, or non-explanatory narration in the new outputs.

Figures

Figures reproduced from arXiv: 2605.17064 by Jan Zierstek, Matteo Batelic, Maya Medjad, Tim Sch\"onenberger.

Figure 1
Figure 1. Figure 1: The figure illustrates the transformation of raw book text into a hierarchical planning [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Token-level characterization of the corpus sequence. Top: upper-envelope token composi [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical structure of a composed training example. The representation begins with the [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Large language models optimized for instruction following and agentic tasks remain poorly aligned with the requirements of high-quality creative writing. Fiction frequently depends on behaviors that assistant-tuned models are explicitly trained to avoid, particularly deception, moral ambiguity, and unreliable narration. As a result, generated stories often appear structurally correct while remaining stylistically generic, overly explanatory, or weakly grounded in human literary behavior. We present a dataset construction and training framework for book-scale creative writing that reframes supervised fine-tuning as a prompt-to-book generation task grounded in human-authored fiction. Starting from public-domain novels, we derive a multi-resolution planning scaffold by summarizing each book at progressively finer levels, from a high-level premise to chapter- and scene-level structure. We then invert this hierarchy during training: the model learns to expand a prompt into increasingly detailed plans and finally into the original human-authored book text. This formulation preserves human prose as the final supervised target while using intermediate summaries to make book-scale generation learnable. We train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose and toward human literary writing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a dataset construction and training framework for book-scale creative writing with long-context language models. Starting from public-domain novels, it derives multi-resolution summaries (premise to chapter- and scene-level) to form a planning scaffold, then inverts the hierarchy so the model is trained to expand a prompt into increasingly detailed plans and finally the original human-authored text. The central objective is to shift model outputs away from assistant-style prose toward human literary behaviors such as moral ambiguity and unreliable narration.

Significance. If validated, the framework could meaningfully advance creative writing capabilities in LLMs by directly supervising on human-authored fiction rather than synthetic or instruction-tuned targets. The use of public-domain sources and preservation of original prose as the final target is a principled strength that avoids some common pitfalls in synthetic data pipelines. The multi-resolution inversion idea offers a concrete way to address the learnability challenges of book-length generation. However, because the manuscript reports no training runs, outputs, or evaluations, the significance remains entirely prospective.

major comments (2)
  1. [Abstract] Abstract: the statement that the authors 'train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose' is not supported by any reported training procedure, loss curves, generated samples, baseline comparisons, or human judgments; without these the central claim cannot be assessed.
  2. [Dataset construction] Dataset construction section: the claim that multi-resolution summaries 'make book-scale generation learnable while preserving the stylistic and narrative qualities of the original human-authored text' is presented as an axiom without any ablation, artifact analysis, or qualitative comparison showing that the summarization-inversion process does not degrade prose quality or introduce summary-like artifacts.
minor comments (1)
  1. The manuscript would benefit from explicit definitions or notation for the different summary resolutions (e.g., symbols distinguishing premise-level from scene-level summaries) to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. The manuscript currently presents a dataset construction pipeline and training objective for book-scale creative writing but does not include executed training runs or evaluations. We agree this limits the strength of certain claims and will revise the text to accurately reflect the scope of the contribution as a methodological framework while adding supporting analysis where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the statement that the authors 'train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose' is not supported by any reported training procedure, loss curves, generated samples, baseline comparisons, or human judgments; without these the central claim cannot be assessed.

    Authors: We accept this criticism. The abstract employs phrasing that implies completed training and analysis, yet the manuscript reports only the framework design and data construction process. No model was trained, and no outputs or metrics are provided. In revision we will change the abstract and introduction to describe a proposed training procedure and planned study rather than an executed one, making the prospective nature of the empirical claims explicit. revision: yes

  2. Referee: [Dataset construction] Dataset construction section: the claim that multi-resolution summaries 'make book-scale generation learnable while preserving the stylistic and narrative qualities of the original human-authored text' is presented as an axiom without any ablation, artifact analysis, or qualitative comparison showing that the summarization-inversion process does not degrade prose quality or introduce summary-like artifacts.

    Authors: The observation is accurate. The manuscript asserts the benefit of the hierarchical inversion without empirical checks on whether summarization introduces artifacts or alters narrative qualities. We will add a new subsection with side-by-side qualitative examples of premise-, chapter-, and scene-level summaries against the corresponding original passages, together with a brief discussion of observed limitations and potential summary-induced artifacts. Full ablations remain future work. revision: yes

Circularity Check

0 steps flagged

No circularity: framework uses external human-authored text as target

full rationale

The paper describes constructing multi-resolution summaries from public-domain novels and training on inverted prompt-to-book trajectories where the final supervised target is the original human-authored book text. No equations, fitted parameters, or self-citations are presented that reduce the claimed stylistic shift to a quantity defined by the training objective itself. The derivation remains self-contained against external benchmarks because the human prose serves as an independent reference rather than being regenerated from model outputs or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that public-domain novels are representative of high-quality literary fiction and that hierarchical summaries can be inverted without losing narrative coherence or stylistic nuance.

axioms (2)
  • domain assumption Public-domain novels provide a sufficient and unbiased source of high-quality human literary behavior for training targets.
    Stated in the abstract as the starting point for deriving the planning scaffold.
  • ad hoc to paper Multi-resolution summaries can be generated and inverted to make book-scale generation learnable without introducing artifacts that degrade prose quality.
    Central to the prompt-to-book trajectory construction described in the abstract.

pith-pipeline@v0.9.0 · 5728 in / 1352 out tokens · 35381 ms · 2026-05-20T15:37:12.004788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We present a dataset construction and training framework for book-scale creative writing that reframes supervised fine-tuning as a prompt-to-book generation task grounded in human-authored fiction. Starting from public-domain novels, we derive a multi-resolution planning scaffold by summarizing each book at progressively finer levels, from a high-level premise to chapter- and scene-level structure.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The model is trained to generate the planning scaffold and the original book text from a synthetic prompt, following a coarse-to-fine expansion process.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 10 internal anchors

  1. [1]

    InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4393–4479, Abu Dhabi, United Arab Emirates, 2022

    Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein.Re3: Generating longer stories with recursive reprompting and revision. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4393–4479, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.296/

  2. [2]

    InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 3378–3465, Toronto, Canada, 2023

    Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong Tian.DOC: Improving long story coherence with detailed outline control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 3378–3465, Toronto, Canada, 2023. Association for Computational Linguistics. https://aclanthology.org/2023.acl-long.190/

  3. [3]

    Generating long-form story using dynamic hierarchical outlining with memory-enhancement

    Qianyue Wang, Jinwu Hu, Zhengping Li, Yufeng Wang, Daiyuan Li, Yu Hu, and Mingkui Tan. Generating long-form story using dynamic hierarchical outlining with memory-enhancement. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1352–1391, Albuquerq...

  4. [4]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  5. [5]

    Plan-and-Write: Towards better automatic storytelling

    Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. Plan-and-Write: Towards better automatic storytelling. InProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, volume 33, number 1, pages 7378–7385, 2019. https://doi.org/10.1609/aaai.v33i01.33017378

  6. [6]

    Hierarchical neural story generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 889–898, Melbourne, Australia, 2018. Association for Computational Linguistics. https://aclanthology.org/P18-1082/

  7. [7]

    Strategies for structuring story generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Strategies for structuring story generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2650–2660, Florence, Italy, 2019. Association for Computational Linguistics. https://aclanthology.org/P19-1254/

  8. [8]

    InProceedings of the 2020 15 Conference on Empirical Methods in Natural Language Processing, pages 4274–4295, Online, 2020

    Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and Jianfeng Gao.PlotMachines: Outline-conditioned generation with dynamic plot state tracking. InProceedings of the 2020 15 Conference on Empirical Methods in Natural Language Processing, pages 4274–4295, Online, 2020. Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-main.349/

  9. [9]

    Riedl and R

    Mark O. Riedl and R. Michael Young. Narrative planning: Balancing plot and character. Journal of Artificial Intelligence Research, 39:217–268, 2010. https://doi.org/10.1613/jair.2989

  10. [10]

    Martin, Prithviraj Ammanabrolu, Xinyu Wang, William Hancock, Shruti Singh, Brent Harrison, and Mark O

    Lara J. Martin, Prithviraj Ammanabrolu, Xinyu Wang, William Hancock, Shruti Singh, Brent Harrison, and Mark O. Riedl. Event representations for automated story generation with deep neural nets. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 868–875, 2018

  11. [11]

    Project Gutenberg.Project Gutenberg.https://www.gutenberg.org/

  12. [12]

    Qwen3 Technical Report

    Qwen Team. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025. https://arxiv.org/abs/2505.09388

  13. [13]

    Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

    Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. InAdvances in Neural Information Processing Systems, 2023. https://arxiv.org/abs/2306.15895

  14. [14]

    TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

    Ronen Eldan and Yuanzhi Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?arXiv preprint arXiv:2305.07759, 2023. https://arxiv.org/abs/2305.07759

  15. [15]

    Parameterized Synthetic Text Generation with SimpleStories.arXiv preprint arXiv:2504.09184, 2025.https://arxiv.org/abs/2504.09184

    Lennart Finke, Thomas Dooms, Mat Allen, Juan Diego Rodriguez, Noa Nabeshima, and Dan Braun. Parameterized Synthetic Text Generation with SimpleStories.arXiv preprint arXiv:2504.09184, 2025.https://arxiv.org/abs/2504.09184

  16. [16]

    Self-Boosting Large Language Models with Synthetic Preference Data.arXiv preprint arXiv:2410.06961, 2024

    Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. Self-Boosting Large Language Models with Synthetic Preference Data.arXiv preprint arXiv:2410.06961, 2024. https://arxiv.org/abs/2410.06961

  17. [17]

    Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptiste Rozière, and others. Ministral 3.a...

  18. [18]

    Ministral 3 14B Base

    Mistral AI. Ministral 3 14B Base. Hugging Face model card, 2025. https://huggingface.co/mistralai/Ministral-3-14B-Base-2512

  19. [19]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021.https://arxiv.org/abs/2110.14168

  20. [20]

    The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation.arXiv preprint arXiv:2412.04318, 2024

    Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, and Joakim Nivre. The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation.arXiv preprint arXiv:2412.04318, 2024. https://arxiv.org/abs/2412.04318

  21. [21]

    JAX: composable transformations of Python+NumPy programs

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs. Software, 2018. https://github.com/google/jax. 16

  22. [22]

    A Study of BFLOAT16 for Deep Learning Training

    Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja V ooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of bfloat16 for d...

  23. [23]

    ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054, 2020.https://arxiv.org/abs/1910.02054

  24. [24]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring Attention with Blockwise Transformers for Near-Infinite Context.arXiv preprint arXiv:2310.01889, 2023. https://arxiv.org/abs/2310.01889

  25. [25]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

  26. [26]

    Stochastic Rounding for LLM Training: Theory and Practice

    Kaan Ozkara, Tao Yu, and Youngsuk Park. Stochastic Rounding for LLM Training: Theory and Practice. InProceedings of the International Conference on Artificial Intelligence and Statistics, 2025.https://arxiv.org/abs/2502.20566

  27. [27]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and others. The Llama 3 Herd of Models. arXiv preprint arXiv:2407....