Towards Human-Level Book-Writing Capability
Pith reviewed 2026-05-20 15:37 UTC · model grok-4.3
The pith
Training on prompt-to-book trajectories from novel summaries shifts AI writing toward human literary style.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that reframing book-scale generation as the task of expanding a prompt through successively finer plans derived from real novels, and using the original human text as the final supervised target, makes large-scale creative writing learnable while preserving the stylistic and narrative qualities that standard assistant models are trained to suppress.
What carries the argument
The multi-resolution planning scaffold obtained by summarizing each novel at progressively finer levels, which is inverted during training so the model generates plans before expanding them into full human-authored prose.
If this is right
- Outputs would sustain coherent narrative arcs across many chapters instead of drifting into generic explanations.
- Stories would naturally incorporate unreliable narrators and moral complexity drawn from the human targets.
- The final prose would remain closer to the stylistic range of published fiction rather than defaulting to clear, helpful assistant language.
- Book-length generation becomes tractable because the intermediate summary levels provide stepwise supervision.
Where Pith is reading between the lines
- The same hierarchical expansion approach could apply to other long creative forms such as screenplays or game narratives.
- Models trained this way might serve as collaborative drafting tools that produce initial versions closer to human literary drafts.
- The method points toward using real creative artifacts as training signals for any task where current models default to overly safe or explanatory behavior.
Load-bearing premise
The summaries created from public-domain novels supply a planning structure rich enough to let the model learn full book generation without losing the original literary qualities of the target text.
What would settle it
Generate stories from the trained model and a baseline model on the same prompts, then check whether human readers or automated measures detect more frequent use of deception, moral ambiguity, or non-explanatory narration in the new outputs.
Figures
read the original abstract
Large language models optimized for instruction following and agentic tasks remain poorly aligned with the requirements of high-quality creative writing. Fiction frequently depends on behaviors that assistant-tuned models are explicitly trained to avoid, particularly deception, moral ambiguity, and unreliable narration. As a result, generated stories often appear structurally correct while remaining stylistically generic, overly explanatory, or weakly grounded in human literary behavior. We present a dataset construction and training framework for book-scale creative writing that reframes supervised fine-tuning as a prompt-to-book generation task grounded in human-authored fiction. Starting from public-domain novels, we derive a multi-resolution planning scaffold by summarizing each book at progressively finer levels, from a high-level premise to chapter- and scene-level structure. We then invert this hierarchy during training: the model learns to expand a prompt into increasingly detailed plans and finally into the original human-authored book text. This formulation preserves human prose as the final supervised target while using intermediate summaries to make book-scale generation learnable. We train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose and toward human literary writing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a dataset construction and training framework for book-scale creative writing with long-context language models. Starting from public-domain novels, it derives multi-resolution summaries (premise to chapter- and scene-level) to form a planning scaffold, then inverts the hierarchy so the model is trained to expand a prompt into increasingly detailed plans and finally the original human-authored text. The central objective is to shift model outputs away from assistant-style prose toward human literary behaviors such as moral ambiguity and unreliable narration.
Significance. If validated, the framework could meaningfully advance creative writing capabilities in LLMs by directly supervising on human-authored fiction rather than synthetic or instruction-tuned targets. The use of public-domain sources and preservation of original prose as the final target is a principled strength that avoids some common pitfalls in synthetic data pipelines. The multi-resolution inversion idea offers a concrete way to address the learnability challenges of book-length generation. However, because the manuscript reports no training runs, outputs, or evaluations, the significance remains entirely prospective.
major comments (2)
- [Abstract] Abstract: the statement that the authors 'train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose' is not supported by any reported training procedure, loss curves, generated samples, baseline comparisons, or human judgments; without these the central claim cannot be assessed.
- [Dataset construction] Dataset construction section: the claim that multi-resolution summaries 'make book-scale generation learnable while preserving the stylistic and narrative qualities of the original human-authored text' is presented as an axiom without any ablation, artifact analysis, or qualitative comparison showing that the summarization-inversion process does not degrade prose quality or introduce summary-like artifacts.
minor comments (1)
- The manuscript would benefit from explicit definitions or notation for the different summary resolutions (e.g., symbols distinguishing premise-level from scene-level summaries) to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. The manuscript currently presents a dataset construction pipeline and training objective for book-scale creative writing but does not include executed training runs or evaluations. We agree this limits the strength of certain claims and will revise the text to accurately reflect the scope of the contribution as a methodological framework while adding supporting analysis where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that the authors 'train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose' is not supported by any reported training procedure, loss curves, generated samples, baseline comparisons, or human judgments; without these the central claim cannot be assessed.
Authors: We accept this criticism. The abstract employs phrasing that implies completed training and analysis, yet the manuscript reports only the framework design and data construction process. No model was trained, and no outputs or metrics are provided. In revision we will change the abstract and introduction to describe a proposed training procedure and planned study rather than an executed one, making the prospective nature of the empirical claims explicit. revision: yes
-
Referee: [Dataset construction] Dataset construction section: the claim that multi-resolution summaries 'make book-scale generation learnable while preserving the stylistic and narrative qualities of the original human-authored text' is presented as an axiom without any ablation, artifact analysis, or qualitative comparison showing that the summarization-inversion process does not degrade prose quality or introduce summary-like artifacts.
Authors: The observation is accurate. The manuscript asserts the benefit of the hierarchical inversion without empirical checks on whether summarization introduces artifacts or alters narrative qualities. We will add a new subsection with side-by-side qualitative examples of premise-, chapter-, and scene-level summaries against the corresponding original passages, together with a brief discussion of observed limitations and potential summary-induced artifacts. Full ablations remain future work. revision: yes
Circularity Check
No circularity: framework uses external human-authored text as target
full rationale
The paper describes constructing multi-resolution summaries from public-domain novels and training on inverted prompt-to-book trajectories where the final supervised target is the original human-authored book text. No equations, fitted parameters, or self-citations are presented that reduce the claimed stylistic shift to a quantity defined by the training objective itself. The derivation remains self-contained against external benchmarks because the human prose serves as an independent reference rather than being regenerated from model outputs or prior self-referential results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Public-domain novels provide a sufficient and unbiased source of high-quality human literary behavior for training targets.
- ad hoc to paper Multi-resolution summaries can be generated and inverted to make book-scale generation learnable without introducing artifacts that degrade prose quality.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present a dataset construction and training framework for book-scale creative writing that reframes supervised fine-tuning as a prompt-to-book generation task grounded in human-authored fiction. Starting from public-domain novels, we derive a multi-resolution planning scaffold by summarizing each book at progressively finer levels, from a high-level premise to chapter- and scene-level structure.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The model is trained to generate the planning scaffold and the original book text from a synthetic prompt, following a coarse-to-fine expansion process.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein.Re3: Generating longer stories with recursive reprompting and revision. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4393–4479, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.296/
work page 2022
-
[2]
Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong Tian.DOC: Improving long story coherence with detailed outline control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 3378–3465, Toronto, Canada, 2023. Association for Computational Linguistics. https://aclanthology.org/2023.acl-long.190/
work page 2023
-
[3]
Generating long-form story using dynamic hierarchical outlining with memory-enhancement
Qianyue Wang, Jinwu Hu, Zhengping Li, Yufeng Wang, Daiyuan Li, Yu Hu, and Mingkui Tan. Generating long-form story using dynamic hierarchical outlining with memory-enhancement. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1352–1391, Albuquerq...
work page 2025
-
[4]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Plan-and-Write: Towards better automatic storytelling
Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. Plan-and-Write: Towards better automatic storytelling. InProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, volume 33, number 1, pages 7378–7385, 2019. https://doi.org/10.1609/aaai.v33i01.33017378
-
[6]
Hierarchical neural story generation
Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 889–898, Melbourne, Australia, 2018. Association for Computational Linguistics. https://aclanthology.org/P18-1082/
work page 2018
-
[7]
Strategies for structuring story generation
Angela Fan, Mike Lewis, and Yann Dauphin. Strategies for structuring story generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2650–2660, Florence, Italy, 2019. Association for Computational Linguistics. https://aclanthology.org/P19-1254/
work page 2019
-
[8]
Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and Jianfeng Gao.PlotMachines: Outline-conditioned generation with dynamic plot state tracking. InProceedings of the 2020 15 Conference on Empirical Methods in Natural Language Processing, pages 4274–4295, Online, 2020. Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-main.349/
work page 2020
-
[9]
Mark O. Riedl and R. Michael Young. Narrative planning: Balancing plot and character. Journal of Artificial Intelligence Research, 39:217–268, 2010. https://doi.org/10.1613/jair.2989
-
[10]
Lara J. Martin, Prithviraj Ammanabrolu, Xinyu Wang, William Hancock, Shruti Singh, Brent Harrison, and Mark O. Riedl. Event representations for automated story generation with deep neural nets. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 868–875, 2018
work page 2018
-
[11]
Project Gutenberg.Project Gutenberg.https://www.gutenberg.org/
-
[12]
Qwen Team. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025. https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. InAdvances in Neural Information Processing Systems, 2023. https://arxiv.org/abs/2306.15895
-
[14]
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Ronen Eldan and Yuanzhi Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?arXiv preprint arXiv:2305.07759, 2023. https://arxiv.org/abs/2305.07759
work page internal anchor Pith review arXiv 2023
-
[15]
Lennart Finke, Thomas Dooms, Mat Allen, Juan Diego Rodriguez, Noa Nabeshima, and Dan Braun. Parameterized Synthetic Text Generation with SimpleStories.arXiv preprint arXiv:2504.09184, 2025.https://arxiv.org/abs/2504.09184
-
[16]
Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. Self-Boosting Large Language Models with Synthetic Preference Data.arXiv preprint arXiv:2410.06961, 2024. https://arxiv.org/abs/2410.06961
-
[17]
Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptiste Rozière, and others. Ministral 3.a...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Mistral AI. Ministral 3 14B Base. Hugging Face model card, 2025. https://huggingface.co/mistralai/Ministral-3-14B-Base-2512
work page 2025
-
[19]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021.https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, and Joakim Nivre. The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation.arXiv preprint arXiv:2412.04318, 2024. https://arxiv.org/abs/2412.04318
-
[21]
JAX: composable transformations of Python+NumPy programs
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs. Software, 2018. https://github.com/google/jax. 16
work page 2018
-
[22]
A Study of BFLOAT16 for Deep Learning Training
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja V ooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of bfloat16 for d...
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[23]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054, 2020.https://arxiv.org/abs/1910.02054
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[24]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring Attention with Blockwise Transformers for Near-Infinite Context.arXiv preprint arXiv:2310.01889, 2023. https://arxiv.org/abs/2310.01889
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Stochastic Rounding for LLM Training: Theory and Practice
Kaan Ozkara, Tao Yu, and Youngsuk Park. Stochastic Rounding for LLM Training: Theory and Practice. InProceedings of the International Conference on Artificial Intelligence and Statistics, 2025.https://arxiv.org/abs/2502.20566
-
[27]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and others. The Llama 3 Herd of Models. arXiv preprint arXiv:2407....
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.