pith. sign in

arxiv: 2604.13452 · v1 · submitted 2026-04-15 · 💻 cs.CL

CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding

Pith reviewed 2026-05-10 13:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords visual storytellingstoryboardingmulti-agent systemscontinuity preservationcharacter consistencybackground stabilitygenerative models
0
0 comments X

The pith

CANVAS is a multi-agent framework that plans character consistency, background anchors, and location-aware transitions to preserve visual continuity across multiple shots in storyboards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CANVAS as a way to fix the problem that single-frame image generators produce strong individual images but lose coherence when strung into longer visual stories. It does this by breaking the task into specialized agents that track characters across shots, keep backgrounds stable with persistent anchors, and plan scene locations to make transitions smoother. This approach is tested on two existing storyboard benchmarks plus a new one designed for long-range consistency challenges. If the gains hold, the work shows that explicit multi-agent decomposition can address a core limitation in generative visual storytelling without relying solely on better base models.

Core claim

CANVAS enforces coherence through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting. When evaluated on ST-BENCH, ViStoryBench, and the new HardContinuityBench, it improves background continuity by 21.6 percent, character consistency by 9.6 percent, and props consistency by 7.6 percent over the strongest baseline.

What carries the argument

The multi-agent decomposition that assigns separate agents to enforce character continuity, maintain persistent background anchors, and perform location-aware scene planning for transitions.

If this is right

  • Storyboard generation tools can achieve higher character, background, and prop consistency without retraining base image models.
  • A dedicated benchmark for long-range narrative consistency can be used to measure progress on extended visual sequences.
  • Explicit planning of scene locations and background anchors reduces abrupt shifts when the same setting appears across non-consecutive shots.
  • The same decomposition strategy can be applied to other multi-shot generative tasks such as comic creation or animation pre-visualization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the agentic structure is the main source of gains, similar decomposition could improve continuity in video generation models that currently struggle with temporal coherence.
  • The method suggests that pre-planning anchors and locations may reduce post-generation editing effort in professional visual storytelling workflows.
  • Extending the framework to handle camera movement or lighting consistency could address additional failure modes not covered in the current benchmarks.

Load-bearing premise

The chosen automatic metrics and benchmarks accurately measure the continuity that humans actually perceive in visual stories.

What would settle it

A controlled human study in which participants rate storyboards from CANVAS as less continuous overall than those from the best baseline would falsify the central performance claim.

read the original abstract

Long-form visual storytelling requires maintaining continuity across shots, including consistent characters, stable environments, and smooth scene transitions. While existing generative models can produce strong individual frames, they fail to preserve such continuity, leading to appearance changes, inconsistent backgrounds, and abrupt scene shifts. We introduce CANVAS (Continuity-Aware Narratives via Visual Agentic Storyboarding), a multi-agent framework that explicitly plans visual continuity in multi-shot narratives. CANVAS enforces coherence through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting We evaluate CANVAS on two storyboard generation benchmarks ST-BENCH and ViStoryBench and introduce a new challenging benchmark HardContinuityBench for long-range narrative consistency. CANVAS consistently outperforms the best-performing baseline, improving background continuity by 21.6%, character consistency by 9.6% and props consistency by 7.6%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CANVAS, a multi-agent framework for visual storyboarding that explicitly plans character continuity, persistent background anchors, and location-aware transitions to improve coherence in long-form narratives. It evaluates the approach on ST-BENCH and ViStoryBench, introduces HardContinuityBench, and reports consistent outperformance over baselines with gains of 21.6% in background continuity, 9.6% in character consistency, and 7.6% in props consistency.

Significance. If the reported gains prove robust and attributable to the agentic architecture rather than prompt content or base model choice, CANVAS would offer a practical structured method for addressing continuity failures in generative visual storytelling, a core limitation of current models. The new HardContinuityBench could also serve as a useful resource for future work on long-range narrative consistency.

major comments (2)
  1. [§4] §4 (Experiments): The central claim attributes the 21.6%/9.6%/7.6% gains specifically to the multi-agent decomposition for continuity planning, yet no control is described that holds the underlying model and prompt content fixed while comparing the agentic structure against a monolithic prompt encoding identical continuity rules (character anchors, background persistence, location transitions). This control is load-bearing for the architecture-level conclusion.
  2. [§4] §4 and Abstract: The quantitative improvements are reported as single percentage deltas without error bars, number of runs, statistical tests, or ablation breakdowns (e.g., contribution of each agent), making it impossible to determine whether the gains exceed generation variance or result from post-hoc choices.
minor comments (2)
  1. [§3] §3 (Method): The description of how the three agents interact and pass information (e.g., exact message formats between character, background, and transition agents) could be expanded with pseudocode or a diagram for reproducibility.
  2. [§4.1] HardContinuityBench: Details on benchmark construction (number of stories, shot length distribution, annotation protocol) are needed to evaluate whether it genuinely stresses long-range consistency beyond existing datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to incorporate the suggested controls and statistical details.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central claim attributes the 21.6%/9.6%/7.6% gains specifically to the multi-agent decomposition for continuity planning, yet no control is described that holds the underlying model and prompt content fixed while comparing the agentic structure against a monolithic prompt encoding identical continuity rules (character anchors, background persistence, location transitions). This control is load-bearing for the architecture-level conclusion.

    Authors: We agree that a direct comparison holding the base model and prompt content fixed while varying only the agentic decomposition versus a monolithic prompt with identical continuity rules would more cleanly isolate the contribution of the multi-agent architecture. Our current baselines use a range of prompting and model configurations, but this specific control was not included. We will add the requested monolithic-prompt control experiment in the revised manuscript. revision: yes

  2. Referee: [§4] §4 and Abstract: The quantitative improvements are reported as single percentage deltas without error bars, number of runs, statistical tests, or ablation breakdowns (e.g., contribution of each agent), making it impossible to determine whether the gains exceed generation variance or result from post-hoc choices.

    Authors: We concur that error bars, the number of independent runs, statistical significance tests, and per-agent ablation breakdowns are necessary to assess robustness beyond single-run deltas. These elements were omitted from the initial submission. We will include multiple-run statistics, error bars, significance tests, and agent-wise ablations in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivation chain

full rationale

The paper introduces CANVAS as a multi-agent framework for storyboard generation and reports empirical improvements on continuity metrics across ST-BENCH, ViStoryBench, and the new HardContinuityBench. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described evaluation setup. Central claims rest on direct benchmark comparisons rather than any reduction to inputs by construction, self-citation chains, or smuggled ansatzes, rendering the reported gains self-contained through experimental measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented physical entities are described in the abstract; the contribution is an engineering framework whose internal design choices are not detailed here.

pith-pipeline@v0.9.0 · 5469 in / 1065 out tokens · 23521 ms · 2026-05-10T13:51:07.471632+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Wethereforeevaluatewhether characters preserve consistent identity cues when they reappear in different frames

    Character Consistency (CharCons).Characters are the central entities of most narratives, and maintainingstableidentityacrossshotsiscriticalforvisualstorytelling. Wethereforeevaluatewhether characters preserve consistent identity cues when they reappear in different frames. This dimension is decomposed into three components:Facial Consistency (FaceCons), w...

  2. [2]

    To capture such spatial errors, we evaluate whether thenon-movable architectural elements of the environment remain geometrically coherent across shots

    Non-Movable Object Consistency (GeomCons).Even when characters remain consistent, storyboardframesmayviolatethephysicalstructureoftheenvironment—forexample, walls, windows, or pillars may suddenly appear or disappear between shots. To capture such spatial errors, we evaluate whether thenon-movable architectural elements of the environment remain geometric...

  3. [3]

    Generative models often introduce inconsistencies where props suddenly appear, disappear, or change across shots

    Movable Object Consistency (PropCons).Storyboard continuity also requires maintaining the identity and placement of movable objects within the scene. Generative models often introduce inconsistencies where props suddenly appear, disappear, or change across shots. To measure this effect, we evaluate whethermovable objects (props) remain consistent across f...

  4. [4]

    Identify the physical environment in which each shot occurs

  5. [5]

    Assign a uniquelocation_idto each distinct location

  6. [6]

    Cluster shots that occur in the same environment under the samelocation_id

  7. [7]

    If a shot returns to a previously seen environment, mark it aslocation_reappearance

  8. [8]

    If a shot directly continues the previous scene without changing environment, mark it as previous_frame_continuation

  9. [9]

    locations

    If the location appears for the first time, mark it asfresh_location. Return STRICT JSON in the following format: { "locations": { "museum_gallery": [ "Scene_1_Shot_1", "Scene_1_Shot_2", "Scene_3_Shot_1" ], "security_room": [ "Scene_2_Shot_1" ] }, "shots": { "Scene_1_Shot_1": { "location_id": "museum_gallery", "continuity_mode": "fresh_location" } } } Con...

  10. [10]

    Which props should remain visible in the background environment

  11. [11]

    Which props should disappear because they were taken, destroyed, or moved

  12. [12]

    Which props are likely to appear in the background due to upcoming story events

  13. [13]

    Which props are currently carried by characters and therefore should not remain in the environment

  14. [14]

    Which environmental objects must persist across shots to maintain scene identity (e.g., furniture, structures, display cases)

  15. [15]

    background_props

    Whether the camera framing hides some props even though they still exist in the environment. Return STRICT JSON in the following format: { "background_props": [ "display_case", "museum_sign", "security_camera" ], "must_appear": [ "golden_artifact" ], "must_not_appear": [ "artifact_case_glass" ], "carried_props": { "thief": ["golden_artifact"] }, "reasonin...

  16. [16]

    For each shot, determine whether the target character appears in the scene

  17. [17]

    If the character appears, determine the character’s visual appearance state (e.g., clothing, uniform, disguise)

  18. [18]

    If the appearance does not change, reuse the same appearance state identifier as the previous shot

  19. [19]

    If the story explicitly changes the character’s clothing or visual style, assign a new appear- ance state

  20. [20]

    character

    If the character does not appear in a shot, mark the state asnot_present. Return STRICT JSON in the following format: { "character": "Ethan", "appearance_timeline": { "Scene_1_Shot_1": "blue_curator_jacket", "Scene_1_Shot_2": "blue_curator_jacket", "Scene_1_Shot_3": "blue_curator_jacket", "Scene_2_Shot_1": "black_security_jacket", "Scene_2_Shot_2": "not_p...

  21. [21]

    For each shot, determine whether the target prop appears in the scene

  22. [22]

    If the prop appears, determine its state (e.g., intact, broken, inside container, carried by character, missing)

  23. [23]

    If the prop is carried by a character, record which character carries it

  24. [24]

    If the prop is removed from the scene (e.g., stolen, destroyed, moved), update its state accordingly

  25. [25]

    If the prop continues unchanged across shots, reuse the same state identifier

  26. [26]

    prop": "golden_artifact

    If the prop does not appear in a shot but still exists in the story world, mark it asnot_visible. Return STRICT JSON in the following format: { "prop": "golden_artifact", "state_timeline": { "Scene_1_Shot_1": { "state": "inside_display_case", "carrier": null }, "Scene_1_Shot_2": { "state": "inside_display_case", "carrier": null }, "Scene_1_Shot_3": { "sta...

  27. [27]

    Whether the previous shot and current shot occur in the same physical environment

  28. [28]

    Whether the current shot is an immediate temporal continuation rather than a later revisit

  29. [29]

    Whether the spatial arrangement of the scene should remain consistent across the cut

  30. [30]

    Whether important background geometry must be preserved, such as walls, doors, furniture, display cases, tables, platforms, or room layout

  31. [31]

    Whether character positions, object placements, or scene composition depend on the previous frame

  32. [32]

    Whether the current shot is a close-up, zoom-in, alternate camera angle, or tighter crop of the same ongoing scene

  33. [33]

    continuity_mode

    Whether there are state changes in the scene that still require maintaining the same base spatial environment. Use the following rules: • Choose previous_frame_continuation if the scene is still unfolding in the same environment and preserving the exact spatial setup from the previous frame is important. • Choose location_reappearance if the shot returns ...

  34. [34]

    Initialize anchor set:𝐴 𝑡 ← ∅

  35. [35]

    (a) If the planner determines that spatial dependencies from the previous frame must be preserved: i

    Retrieve shot plan from global planner: 𝑝𝑡 =𝑃(𝑠 𝑡 ) 3.Continuation Reasoning Determine whether the current shot is a direct continuation of the previous shot. (a) If the planner determines that spatial dependencies from the previous frame must be preserved: i. Add most recent frame anchor: 𝐴𝑡 ←𝐴 𝑡 ∪ {𝐼 𝑡−1 } ii. Proceed to character anchor retrieval 4.Cha...

  36. [36]

    Your task is to generate a candidate frame for the current shot while maintaining visual continuity with previously generated frames and respecting the global story plan

    Return(𝐴 𝑡 ,Π 𝑡 ) 39 CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding Prompt: Candidate Frame Generation You are a cinematic storyboard generator. Your task is to generate a candidate frame for the current shot while maintaining visual continuity with previously generated frames and respecting the global story plan. Input •Shot descrip...

  37. [37]

    Retrieve shot plan from global planner: 𝑝𝑡 =𝑃(𝑠 𝑡)

  38. [38]

    Extract continuity constraints from the global plan: •Character appearance states𝐶(𝑐, 𝑡) •Location identity𝐿(𝑠 𝑡) •Prop states𝑂(𝑜, 𝑡)

  39. [39]

    Construct theVLM evaluation promptthat includes: •Shot description𝑠 𝑡 •Character appearance states𝐶(𝑐, 𝑡) •Prop states𝑂(𝑜, 𝑡) •Location identifier𝐿(𝑠 𝑡) •Previous frame𝐼 𝑡−1 (if continuation)

  40. [40]

    Initialize scores: 𝑠𝑐𝑜𝑟𝑒 𝑖 =0∀𝑖∈ {1, ..., 𝑁} 5.Evaluate candidate frames For each candidate image𝐼(𝑖) 𝑡 : (a) Provide the candidate image and the evaluation prompt to the VLM (b) The VLM evaluates the frame using the scoring rubric and returns: •Shot alignment score •Character consistency score •Background continuity score •Prop state correctness score (c...

  41. [41]

    Select best candidate: 𝐼∗ 𝑡 =arg max 𝑖 𝑠𝑐𝑜𝑟𝑒 𝑖

  42. [42]

    shot_alignment

    Return selected frame𝐼∗ 𝑡 41 CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding Prompt: VLM-Based Continuity Scoring Prompt You are a visual continuity evaluator. Your task is to evaluate a candidate storyboard frame and assign scores based on how well it satisfies the global continuity plan and the current shot description. Input •Curre...

  43. [43]

    Query the global plan to obtain: •Character appearance statesC (𝑐, 𝑡) •Location identityL (𝑠 𝑡) •Object statesO (𝑜, 𝑡)

  44. [44]

    Provide the selected frame𝐼∗ 𝑡 together with the shot description and entity descriptions to a VLM-based extractor (prompt in Table??)

  45. [45]

    5.Update Location Memory Let𝑙=L (𝑠 𝑡)

    The VLM identifies visible entities and produces clean visual anchors: •Character anchors •Location/background anchor •Prop anchors 4.Update Character Memory For each detected character𝑐: M𝑐 (𝑐,C (𝑐, 𝑡)) ←𝑎𝑛𝑐ℎ𝑜𝑟 𝑐 where𝑎𝑛𝑐ℎ𝑜𝑟 𝑐 is the extracted character image representing the current appearance. 5.Update Location Memory Let𝑙=L (𝑠 𝑡). M𝑙 (𝑙) ←𝑎𝑛𝑐ℎ𝑜𝑟 𝑙 whe...

  46. [46]

    characters

    Return updated memoryM. 43 CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding Prompt: Character Anchor Extraction You are extracting visual anchors for characters from a storyboard frame. Input •Frame image𝐼 𝑡 •Shot description𝑠 𝑡 •Character appearance statesC (𝑐, 𝑡) Task Identify characters visible in the frame and extract a clean visua...