STAGE: A Full-Screenplay Benchmark for Reasoning over Evolving Storie
Pith reviewed 2026-05-16 14:36 UTC · model grok-4.3
The pith
STAGE benchmark tests whether models can build and maintain coherent story worlds across full movie screenplays using four linked tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STAGE defines four tasks—knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing—all grounded in a shared narrative world representation. The benchmark supplies cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.
What carries the argument
The shared narrative world representation realized as curated knowledge graphs that all four tasks operate over.
If this is right
- Models must demonstrate cross-task consistency on the same story facts to score well.
- Long-context question answering is evaluated only after graph construction and event summarization steps.
- Character role-playing responses are checked against the same knowledge graph used for summarization.
- The benchmark covers both English and Chinese scripts, allowing direct comparison of cross-lingual narrative reasoning.
Where Pith is reading between the lines
- Success on STAGE would suggest models can serve as reliable story-world engines for interactive fiction or script analysis tools.
- Failure patterns could reveal whether current long-context architectures lose track of character relationships over hundreds of pages.
- The design could be extended to television series or novels by reusing the same graph-plus-tasks structure.
Load-bearing premise
The curated knowledge graphs, event annotations, and character labels for the 150 films accurately and comprehensively capture the narrative elements needed to evaluate true story-world consistency.
What would settle it
A model that scores highly on all four tasks while using only raw screenplay text and without ever producing or consulting the provided knowledge graphs would falsify the claim that the shared representation is necessary for consistent reasoning.
read the original abstract
Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces STAGE, a unified benchmark for narrative understanding over full-length movie screenplays. It defines four tasks—knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing—all grounded in a shared narrative world representation, and provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese.
Significance. If the annotations prove accurate and comprehensive, STAGE would offer a valuable holistic evaluation framework that moves beyond isolated subtasks to assess models' abilities to build consistent story worlds, abstract events, reason over long contexts, and maintain character consistency, filling a gap in existing narrative benchmarks.
major comments (1)
- [Abstract] Abstract: The central claim that the four tasks enable reliable measurement of world-building, event abstraction, long-context reasoning, and character consistency depends on the curated knowledge graphs, scene-level event summaries, and character-centric labels accurately encoding the screenplay content. However, the manuscript provides no details on the curation protocol, inter-annotator agreement, validation against source scripts, or coverage of plot branches and implicit relations, leaving the ground truth quality unverified and the holistic evaluation claim unsupported.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the need for transparency in our annotation and curation processes. We agree that documenting these details is critical to substantiate the benchmark's claims and will revise the manuscript accordingly to address this concern directly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the four tasks enable reliable measurement of world-building, event abstraction, long-context reasoning, and character consistency depends on the curated knowledge graphs, scene-level event summaries, and character-centric labels accurately encoding the screenplay content. However, the manuscript provides no details on the curation protocol, inter-annotator agreement, validation against source scripts, or coverage of plot branches and implicit relations, leaving the ground truth quality unverified and the holistic evaluation claim unsupported.
Authors: We acknowledge that the current version of the manuscript does not provide sufficient detail on the annotation curation protocol, inter-annotator agreement, validation steps, or handling of plot branches and implicit relations. In the revised manuscript we will add a dedicated subsection (3.2 Annotation Protocol) that describes: the multi-stage process involving trained annotators and script experts; inter-annotator agreement metrics (Cohen's kappa and Fleiss' kappa reported for event summaries and character labels); validation procedures including direct cross-referencing with source scripts and discrepancy resolution; and our coverage strategy for plot branches and implicit relations, which focuses on explicit narrative elements plus key inferences supported by dialogue, with quantitative coverage statistics included. These additions will directly support the abstract's claims regarding reliable measurement of the four capabilities. revision: yes
Circularity Check
No significant circularity; benchmark creation is self-contained
full rationale
The paper introduces STAGE as a new benchmark with four tasks grounded in curated resources for 150 films. No equations, fitted parameters, predictions, or derivations are present that could reduce to inputs by construction. The contribution consists of cleaned scripts, knowledge graphs, event summaries, and character annotations; these are presented as externally curated artifacts rather than outputs of any self-referential process. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. This is a standard dataset-release paper whose claims rest on the independent value of the released resources, not on any internal reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Narrative understanding over evolving stories can be adequately evaluated through the four tasks of knowledge graph construction, scene summarization, long-context QA, and character role-playing grounded in a shared world representation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.