arxiv: 2604.11435 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Recognition: unknown

Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books

Argyrios Papoudakis , Mirella Lapata , Frank Keller

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords character description generationQA-guided reasoninglong-form narrativesnarrative understandingLLM reasoning decouplingBookWorm datasetCroSS dataset

0 comments

The pith

Separating reasoning into a structured QA trace before writing character descriptions leads to more faithful outputs from long novels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long novels pose a challenge for character description generation because models must track evolving attributes like relationships and events while integrating evidence scattered throughout the text. Direct application of built-in reasoning in language models can actually degrade performance on this task. The authors propose decoupling the process by training one model to produce a structured question-and-answer trace of relevant character details and then conditioning a second model on that trace to generate the description. Experiments on the BookWorm and CroSS datasets demonstrate gains in faithfulness, informativeness, and grounding compared with strong long-context baselines. The framework can be layered on top of existing long-context or chunk-based language models.

Core claim

The central claim is that a training framework decoupling reasoning from generation, consisting of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace, produces character descriptions with improved faithfulness, informativeness, and grounding over strong long-context baselines on two book datasets.

What carries the argument

The structured QA reasoning trace, a set of questions and answers about character attributes, relationships, and events drawn from the narrative.

If this is right

Better tracking of how character attributes change across an entire narrative.
More effective integration of evidence that appears in different parts of the text.
Improved inference of implicit details not stated directly in the story.
The method can be applied on top of long-context LLMs or chunk-based processing pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The QA trace could support interactive editing or querying of character profiles in writing assistance tools.
Similar decoupling of reasoning and generation may help with other long-document tasks such as plot summarization or event tracking.
Explicit reasoning structures might reduce hallucination in any narrative generation setting that involves complex entity evolution.

Load-bearing premise

The reasoning model can reliably produce a complete structured QA trace that captures evolving attributes and implicit details without introducing errors or missing key evidence.

What would settle it

Human or automatic evaluations on the BookWorm and CroSS datasets that rate the QA-guided descriptions as no more faithful or informative than those produced by direct long-context generation.

Figures

Figures reproduced from arXiv: 2604.11435 by Argyrios Papoudakis, Frank Keller, Mirella Lapata.

read the original abstract

Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find that for character description generation their performance improves when built-in reasoning is disabled (i.e., an empty reasoning trace). Motivated by this, we propose a training framework that decouples reasoning from generation. Our approach, which can be applied on top of long-context LLMs or chunk-based methods, consists of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace to produce the final character description. Experiments on two datasets (BookWorm and CroSS) show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes decoupling reasoning from generation for character description tasks in long narratives. Motivated by the observation that LLMs perform better on this task when built-in reasoning is disabled, the authors introduce a framework with a reasoning model that outputs a structured QA trace (capturing evolving attributes, scattered evidence, and implicit details) and a generation model that conditions on this trace. Experiments on the BookWorm and CroSS datasets report that this QA-guided approach improves faithfulness, informativeness, and grounding relative to strong long-context LLM baselines.

Significance. If the central claim holds, the work offers a practical, modular way to enhance reliability in narrative understanding applications such as summarization and character-driven simulation. The decoupling strategy is applicable atop existing long-context or chunk-based models and directly targets the challenges of attribute tracking and evidence integration. The empirical motivation from the built-in reasoning ablation is a clear strength, though the absence of direct validation for the reasoning component limits how strongly the results can be attributed to improved reasoning fidelity.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The reported gains in faithfulness, informativeness, and grounding are presented as evidence that the QA-guided reasoning improves performance, yet no human evaluation, automatic metric, or gold-standard comparison is described that verifies the accuracy, completeness, or fidelity of the generated QA traces against the source narratives. Without this, it remains possible that improvements arise from structured formatting, reduced prompt length, or other artifacts rather than reliable reasoning, which is load-bearing for the central claim.
[Abstract and §3 (Method)] Abstract and §3 (Method): The description of results lacks concrete details on the exact evaluation metrics for faithfulness/informativeness/grounding, the identities and configurations of the long-context baselines, and any statistical significance testing or variance reporting. This makes it difficult to assess the magnitude and robustness of the claimed improvements.

minor comments (2)

The paper would benefit from including example QA traces (with source excerpts) in the main text or appendix to illustrate how the reasoning model captures evolving attributes and implicit details.
Clarify whether the reasoning model is trained or prompted, and provide details on the QA template structure and how it is generated for each character.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the practical value of the decoupling approach. We address the two major comments below with targeted revisions to strengthen validation and clarity.

read point-by-point responses

Referee: [§4 (Experiments)] The reported gains in faithfulness, informativeness, and grounding are presented as evidence that the QA-guided reasoning improves performance, yet no human evaluation, automatic metric, or gold-standard comparison is described that verifies the accuracy, completeness, or fidelity of the generated QA traces against the source narratives. Without this, it remains possible that improvements arise from structured formatting, reduced prompt length, or other artifacts rather than reliable reasoning, which is load-bearing for the central claim.

Authors: We agree that the absence of direct evaluation of the QA traces leaves room for alternative explanations such as formatting effects. Our end-to-end results on BookWorm and CroSS demonstrate consistent gains, but to more rigorously attribute them to reasoning fidelity we will add a new analysis subsection in §4. This will include human ratings of trace accuracy, completeness, and grounding on a random sample of 100 examples per dataset, with inter-annotator agreement reported. We will also compare against a control condition that uses only structured formatting without content-specific QA. revision: yes
Referee: [Abstract and §3 (Method)] The description of results lacks concrete details on the exact evaluation metrics for faithfulness/informativeness/grounding, the identities and configurations of the long-context baselines, and any statistical significance testing or variance reporting. This makes it difficult to assess the magnitude and robustness of the claimed improvements.

Authors: We acknowledge that the abstract and high-level method description could be more self-contained. Full metric definitions (human Likert scales for faithfulness and grounding plus any automatic proxies), baseline configurations (e.g., specific long-context models, context-window sizes, and prompting strategies), and statistical details (means, standard deviations, and significance tests) already appear in §4. In revision we will expand the abstract with one-sentence summaries of the metrics and baselines and add a short evaluation-setup paragraph to §3 that references the statistical reporting in the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical motivation and external evaluation

full rationale

The paper begins with an empirical observation that disabling built-in reasoning improves character description performance, then introduces a decoupled QA-trace framework evaluated on held-out BookWorm and CroSS datasets against independent long-context baselines. No equations, parameters, or claims reduce by construction to the inputs; the reported gains in faithfulness and informativeness are measured externally rather than forced by definition or self-citation. The derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that structured QA traces can effectively capture and organize narrative information for downstream generation, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)

domain assumption LLMs can be trained to produce useful structured QA reasoning traces from long narratives
Central to the proposed training framework and motivated by the observation that built-in reasoning hurts performance.

pith-pipeline@v0.9.0 · 5484 in / 1117 out tokens · 40509 ms · 2026-05-10T15:00:36.146251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 10 canonical work pages · 4 internal anchors

[1]

Longformer: The Long-Document Transformer

Learning latent personas of film characters. In Proceedings of the 51st Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 352–361, Sofia, Bulgaria. Association for Computational Linguistics. Sabyasachee Baruah and Shrikanth Narayanan. 2023. Character coreference resolution in movie screen- plays. InFindings o...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

let your characters tell their story

“let your characters tell their story”: A dataset for character-centric narrative understanding. In Findings of the Association for Computational Lin- guistics: EMNLP 2021, pages 1734–1752, Punta Cana, Dominican Republic. Association for Com- putational Linguistics. Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, and Mohit Iyyer

2021
[3]

Yapei Chang, Kyle Lo, Tanya Goyal, and Mohit Iyyer

Bleuberi: Bleu is a surprisingly effective reward for instruction following.arXiv preprint arXiv:2505.11080. Yapei Chang, Kyle Lo, Tanya Goyal, and Mohit Iyyer

work page arXiv
[4]

InIn- ternational Conference on Learning Representations

Booookscore: A systematic exploration of book-length summarization in the era of llms. InIn- ternational Conference on Learning Representations. Snigdha Chaturvedi, Shashank Srivastava, Hal Daume III, and Chris Dyer. 2015. Modeling dynamic relationships between characters in literary novels.arXiv preprint arXiv:1511.09376. Nuo Chen, Yan Wang, Haiyun Jiang...

work page arXiv 2015
[5]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Towards question-answering as an automatic metric for evaluating the content quality of a sum- mary.Transactions of the Association for Computa- tional Linguistics, 9:774–789. Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens.Prep...

work page internal anchor Pith review arXiv 2024
[6]

InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1725–1732

A character-centric neural model for auto- mated story generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1725–1732. Dongqi Liu, Xi Yu, Vera Demberg, and Mirella Lapata
[7]

arXiv preprint arXiv:2502.12477 (2025)

Explanatory summarization with discourse- driven planning.Transactions of the Association for Computational Linguistics, 13:1146–1170. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language mod- els use long contexts.Transactions of the Association for Computati...

work page arXiv 2024
[8]

InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4471–4500, Miami, Florida, USA

BookWorm: A dataset for character descrip- tion and analysis. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4471–4500, Miami, Florida, USA. Association for Computational Linguistics. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques- tions for SQuAD. InProceedings of the 56th A...

2024
[9]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Character-LLM: A trainable agent for role- playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore. Association for Computational Linguistics. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. De...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Ziegler and Nisan Stiennon and Ryan Lowe and Jan Leike and Paul Christiano , year=

Recursively summarizing books with human feedback.arXiv preprint arXiv:2109.10862. Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catan- zaro. 2023. Retrieval meets long context large lan- guage models.arXiv preprint arXiv:2310.03025. An Yang, Anfeng Li, Baosong Ya...

work page arXiv 2023
[11]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, and 1 others. 2024. Qwen2. 5-math technical report: Toward mathemat- ical expert model via self-improvement.arXiv preprint arXiv:2409.12122. Mo Yu, Jiangnan Li, Shunyu Yao, We...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8015–8036, Miami, Florida, USA

Evaluating character understanding of large language models via character profiling from fictional works. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8015–8036, Miami, Florida, USA. Association for Computational Linguistics. Weiwei Zhang, Jackie Chi Kit Cheung, and Joel Oren

2024
[13]

Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

Generating character descriptions for auto- matic summarization of fiction. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 33, pages 7476–7483. Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. 2025. Learning to rea- son without external rewards.arXiv preprint arXiv:2505.19590. Lixing Zhu, Runcong Zhao, Lin ...

work page arXiv 2025
[14]

InFind- ings of the Association for Computational Linguis- tics: EMNLP 2023, pages 10098–10121, Singapore

Are NLP models good at tracing thoughts: An overview of narrative understanding. InFind- ings of the Association for Computational Linguis- tics: EMNLP 2023, pages 10098–10121, Singapore. Association for Computational Linguistics. A Implementation Details Training and InferenceWe use the Open- RLHF (Hu et al., 2024) library for GRPO training with the hype...

2023
[15]

Demon Copperhead

codebase to extract facts (prompts are ad- justed based on Papoudakis et al. (2024)) for both PRISMA and NLI metrics. Fact verification is per- formed using the MiniCheck (Tang et al., 2024) model in both metrics. We use rouge-score4 imple- mentation for Rouge-L metric and NLTK5 toolkit 4https://github.com/google-research/ google-research/tree/master/roug...

work page arXiv 2024