STAGE: A Full-Screenplay Benchmark for Reasoning over Evolving Storie

Fan Guo; Fengyi Chen; Jinjing Shen; Qiuyu Tian; Xin Zhang; Yiding Li; Yingce Xia; Yiyun Luo; Youyong Kong; Yuyao Li

arxiv: 2601.08510 · v4 · submitted 2026-01-13 · 💻 cs.CL · cs.AI

STAGE: A Full-Screenplay Benchmark for Reasoning over Evolving Storie

Qiuyu Tian , Zequn Liu , Yiding Li , Fengyi Chen , Youyong Kong , Fan Guo , Yuyao Li , Jinjing Shen

show 4 more authors

Zhijing Xie Yiyun Luo Xin Zhang Yingce Xia

This is my paper

Pith reviewed 2026-05-16 14:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords screenplay benchmarknarrative understandingknowledge graph constructionevent summarizationlong-context question answeringcharacter role-playingstory world representationmovie script analysis

0 comments

The pith

STAGE benchmark tests whether models can build and maintain coherent story worlds across full movie screenplays using four linked tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STAGE as a benchmark that evaluates language models on narrative reasoning over complete movie screenplays rather than isolated subtasks. It supplies cleaned scripts, knowledge graphs, event annotations, and character labels for 150 films in English and Chinese. The four tasks—knowledge graph construction, scene-level event summarization, long-context question answering, and character role-playing—share one underlying narrative world representation. This setup lets researchers measure whether models can construct consistent worlds, verify events, handle long contexts, and produce character-consistent output in a single narrative domain. A sympathetic reader would care because existing benchmarks rarely check if models keep the same story facts straight when switching between building, summarizing, questioning, and generating.

Core claim

STAGE defines four tasks—knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing—all grounded in a shared narrative world representation. The benchmark supplies cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.

What carries the argument

The shared narrative world representation realized as curated knowledge graphs that all four tasks operate over.

If this is right

Models must demonstrate cross-task consistency on the same story facts to score well.
Long-context question answering is evaluated only after graph construction and event summarization steps.
Character role-playing responses are checked against the same knowledge graph used for summarization.
The benchmark covers both English and Chinese scripts, allowing direct comparison of cross-lingual narrative reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on STAGE would suggest models can serve as reliable story-world engines for interactive fiction or script analysis tools.
Failure patterns could reveal whether current long-context architectures lose track of character relationships over hundreds of pages.
The design could be extended to television series or novels by reusing the same graph-plus-tasks structure.

Load-bearing premise

The curated knowledge graphs, event annotations, and character labels for the 150 films accurately and comprehensively capture the narrative elements needed to evaluate true story-world consistency.

What would settle it

A model that scores highly on all four tasks while using only raw screenplay text and without ever producing or consulting the provided knowledge graphs would falsify the claim that the shared representation is necessary for consistent reasoning.

read the original abstract

Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAGE puts together a multi-task benchmark on full screenplays with linked tasks, but its value rests on annotation quality that the abstract leaves unexamined.

read the letter

The main thing here is that STAGE assembles a benchmark covering entire movie screenplays and ties four tasks together: knowledge graph construction, scene-level event summarization, long-context question answering, and character role-playing, all meant to draw from one shared narrative world. This is a step beyond the single-subtask setups that dominate the area right now, and the release of cleaned full scripts plus the derived graphs and labels for 150 bilingual films gives it practical scale.

Referee Report

1 major / 0 minor

Summary. The paper introduces STAGE, a unified benchmark for narrative understanding over full-length movie screenplays. It defines four tasks—knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing—all grounded in a shared narrative world representation, and provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese.

Significance. If the annotations prove accurate and comprehensive, STAGE would offer a valuable holistic evaluation framework that moves beyond isolated subtasks to assess models' abilities to build consistent story worlds, abstract events, reason over long contexts, and maintain character consistency, filling a gap in existing narrative benchmarks.

major comments (1)

[Abstract] Abstract: The central claim that the four tasks enable reliable measurement of world-building, event abstraction, long-context reasoning, and character consistency depends on the curated knowledge graphs, scene-level event summaries, and character-centric labels accurately encoding the screenplay content. However, the manuscript provides no details on the curation protocol, inter-annotator agreement, validation against source scripts, or coverage of plot branches and implicit relations, leaving the ground truth quality unverified and the holistic evaluation claim unsupported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for transparency in our annotation and curation processes. We agree that documenting these details is critical to substantiate the benchmark's claims and will revise the manuscript accordingly to address this concern directly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the four tasks enable reliable measurement of world-building, event abstraction, long-context reasoning, and character consistency depends on the curated knowledge graphs, scene-level event summaries, and character-centric labels accurately encoding the screenplay content. However, the manuscript provides no details on the curation protocol, inter-annotator agreement, validation against source scripts, or coverage of plot branches and implicit relations, leaving the ground truth quality unverified and the holistic evaluation claim unsupported.

Authors: We acknowledge that the current version of the manuscript does not provide sufficient detail on the annotation curation protocol, inter-annotator agreement, validation steps, or handling of plot branches and implicit relations. In the revised manuscript we will add a dedicated subsection (3.2 Annotation Protocol) that describes: the multi-stage process involving trained annotators and script experts; inter-annotator agreement metrics (Cohen's kappa and Fleiss' kappa reported for event summaries and character labels); validation procedures including direct cross-referencing with source scripts and discrepancy resolution; and our coverage strategy for plot branches and implicit relations, which focuses on explicit narrative elements plus key inferences supported by dialogue, with quantitative coverage statistics included. These additions will directly support the abstract's claims regarding reliable measurement of the four capabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark creation is self-contained

full rationale

The paper introduces STAGE as a new benchmark with four tasks grounded in curated resources for 150 films. No equations, fitted parameters, predictions, or derivations are present that could reduce to inputs by construction. The contribution consists of cleaned scripts, knowledge graphs, event summaries, and character annotations; these are presented as externally curated artifacts rather than outputs of any self-referential process. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. This is a standard dataset-release paper whose claims rest on the independent value of the released resources, not on any internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a benchmark rather than a theoretical derivation, so the ledger contains only domain assumptions about what constitutes narrative understanding; no free parameters or invented entities are described.

axioms (1)

domain assumption Narrative understanding over evolving stories can be adequately evaluated through the four tasks of knowledge graph construction, scene summarization, long-context QA, and character role-playing grounded in a shared world representation.
The benchmark design rests on this decomposition of narrative reasoning into the listed tasks.

pith-pipeline@v0.9.0 · 5493 in / 1218 out tokens · 91734 ms · 2026-05-16T14:36:07.554387+00:00 · methodology

STAGE: A Full-Screenplay Benchmark for Reasoning over Evolving Storie

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)