pith. machine review for the scientific record. sign in

arxiv: 2604.03315 · v1 · submitted 2026-04-01 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D storyboard generationinter-shot consistencyeditable 3D scenesmulti-agent verificationspatial-temporal dynamicscontinuity memory graphstory-centric reflection
0
0 comments X

The pith

StoryBlender builds native 3D storyboards that stay consistent across shots while allowing direct edits to cameras and assets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StoryBlender as a framework that turns text or story descriptions into 3D scenes usable for film pre-visualization. It separates global story elements from per-shot details through a continuity memory graph, places every character and object into one shared coordinate system, and then evolves camera and layout choices according to explicit visual metrics. A multi-agent loop checks outputs against a 3D engine and fixes spatial errors before they accumulate. The end result is a single 3D environment that can be tweaked shot by shot without breaking visual identity or continuity. This matters because current 2D generators drift in appearance and traditional 3D tools demand too much manual labor.

Core claim

StoryBlender is a three-stage pipeline governed by a Story-centric Reflection Scheme. Semantic-Spatial Grounding builds a continuity memory graph that decouples shared assets from shot variables. Canonical Asset Materialization places every entity into a single coordinate frame to lock visual identity. Spatial-Temporal Dynamics then uses visual metrics to design layouts and camera moves. A hierarchical agent verification loop feeds engine-rendered checks back into the system, allowing it to self-correct spatial hallucinations while preserving long-horizon multi-shot continuity in the final native 3D scenes.

What carries the argument

The Story-centric Reflection Scheme, a hierarchical agent verification loop that iteratively corrects spatial hallucinations using engine-verified feedback.

If this is right

  • Native 3D output lets users edit cameras and visual assets directly while keeping multi-shot continuity intact.
  • The continuity memory graph decouples global assets from shot-specific variables, supporting long sequences without identity drift.
  • Visual-metric-driven layout and cinematic evolution replace manual key-framing for camera and staging decisions.
  • The system improves measurable consistency and editability compared with both 2D diffusion baselines and prior 3D-grounded methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reflection loop could be reused for other long-horizon 3D generation tasks such as consistent character animation across scenes.
  • Because edits happen inside a single 3D coordinate frame, the approach may support real-time collaborative storyboarding in VR.
  • If the verification loop scales, it could lower the barrier for non-experts to produce professional-grade pre-visualization without learning traditional 3D software.
  • Extending the visual metrics to include narrative pacing or emotional tone would be a direct next step the current pipeline leaves open.

Load-bearing premise

The hierarchical agent loop can reliably detect and fix spatial errors without creating new drift or needing repeated manual fixes.

What would settle it

Generate a five-shot sequence, apply a small camera or asset edit in shot three, then render all shots from the same 3D scene and check whether character appearance or spatial layout remains identical across the sequence.

Figures

Figures reproduced from arXiv: 2604.03315 by Bingliang Li, Daoyi Dong, Hongdong Li, Huadong Mo, Jiaming Bian, Yatao Bian, Yifu Wang, Yuehao Wu, Zhenhong Sun.

Figure 1
Figure 1. Figure 1: Overview of StoryBlender compared to existing storyboarding methods. Left: diffusion-based generation in pixel space; improving consistency typically requires reference inputs. Middle: traditional 3D workflow; strong control but a complex, labor-intensive pipeline. Right (ours): StoryBlender uses a hierarchical multi-agent planning framework to create consistent, editable 3D storyboards across shots. (char… view at source ↗
Figure 2
Figure 2. Figure 2: Hierarchical Multi-Agent Planning Framework. Governed by a Story-centric Reflection Scheme (b), our system utilizes iterative feedback from 3D engines (e.g., Blender) and Vision-Language Models to ensure geometric and narrative consistency. We translate narrative Tstory into 3D storyboards V3D via a three-stage pipeline: (a) Semantic￾Spatial Grounding, where the Director Agent decomposes the story into a s… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of baselines on a complex multi-shot sequence from the film Casablanca. StoryBlender demonstrates stronger geometric consistency and entity management across shots, maintaining the architectural layout and correct character count in each frame. In contrast, StoryDiffusion and Story2Board capture the general semantic atmosphere but exhibit spatial inconsistencies, hallucinating background changes… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of Story-centric Reflection for the Concept Artist and Visual Effects Artist Agents. Visual Effects Artist Agents. To assess the impact of spe￾cialized roles, we remove the Visual Effects Artist Agents from the pipeline. We replace the atmospheric lighting and character animations with a default Nishita sky tex￾ture [22] and static rest poses, which caused Prompt Align￾ment scores to drop (see [PIT… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of Story-centric Reflection for the Lay￾out Artist Agents. Compared to a naive reflection ap￾proach, 3D physical reflection rapidly minimizes spatial errors across reflective turns. Original Shot Editing with Agents Camera New camera: "Close-up on Ugarte's face to emphasize his nervous glances and sweating." (Camera added: 56mm F1.8) Scene Change wall description to: "Wooden planks wall" (Wall textu… view at source ↗
Figure 6
Figure 6. Figure 6: Non-destructive editing workflow. Starting from an initial generation (left), the figure shows two modes of modification. Middle: agent-assisted edits, where natural language updates camera placement, lighting, and textures without changing unrelated geometry. Right: manual edits in the 3D engine for asset addition/removal, asset transformation, and camera adjustment. This dual-mode design demonstrates pre… view at source ↗
read the original abstract

Storyboarding is a core skill in visual storytelling for film, animation, and games. However, automating this process requires a system to achieve two properties that current approaches rarely satisfy simultaneously: inter-shot consistency and explicit editability. While 2D diffusion-based generators produce vivid imagery, they often suffer from identity drift along with limited geometric control; conversely, traditional 3D animation workflows are consistent and editable but require expert-heavy, labor-intensive authoring. We present StoryBlender, a grounded 3D storyboard generation framework governed by a Story-centric Reflection Scheme. At its core, we propose the StoryBlender system, which is built on a three-stage pipeline: (1) Semantic-Spatial Grounding, to construct a continuity memory graph to decouple global assets from shot-specific variables for long-horizon consistency; (2) Canonical Asset Materialization, to instantiate entities in a unified coordinate space to maintain visual identity; and (3) Spatial-Temporal Dynamics, to achieve layout design and cinematic evolution through visual metrics. By orchestrating multiple agents in a hierarchical manner within a verification loop, StoryBlender iteratively self-corrects spatial hallucinations via engine-verified feedback. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi-shot continuity. Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines. Code, data, and demonstration video will be available on https://engineeringai-lab.github.io/StoryBlender/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents StoryBlender, a grounded 3D storyboard generation framework that employs a Story-centric Reflection Scheme and a three-stage pipeline—Semantic-Spatial Grounding to build a continuity memory graph, Canonical Asset Materialization to instantiate entities in a unified coordinate space, and Spatial-Temporal Dynamics for layout and cinematic evolution—together with a hierarchical agent verification loop that uses engine-verified feedback to self-correct spatial hallucinations, yielding native 3D scenes with inter-shot consistency and direct editability of cameras and assets.

Significance. If the central claims hold, the work would offer a practical advance in automated visual storytelling by combining the consistency of 3D representations with explicit editability, potentially lowering the labor barrier in film, animation, and game pre-production relative to both diffusion-based and traditional 3D pipelines. The procedural, engine-grounded verification loop is a methodological strength that could generalize to other long-horizon 3D generation tasks.

major comments (2)
  1. [Experiments section] The abstract states that 'Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines,' yet supplies no quantitative metrics, dataset descriptions, ablation studies, or numerical results; this absence directly undermines the empirical support for the central claim of superiority and must be addressed with concrete tables or figures.
  2. [§3] §3 (three-stage pipeline and verification loop): the claim of 'unwavering multi-shot continuity' rests on the hierarchical agent verification loop successfully correcting spatial hallucinations via engine feedback, but the manuscript reports no iteration counts, correction success rates, or measured drift accumulation across shots; without these quantities the reliability of the loop for long-horizon consistency remains unverified.
minor comments (2)
  1. The abstract promises that 'Code, data, and demonstration video will be available' at a GitHub link, but the link is not expanded and no supplementary material is referenced.
  2. [§2] Notation for the continuity memory graph and unified coordinate space is introduced without an accompanying diagram or formal definition in the early sections, making the pipeline flow harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the current manuscript requires substantial strengthening of the empirical section and additional quantitative details on the verification loop. We will revise the paper to address both points fully.

read point-by-point responses
  1. Referee: [Experiments section] The abstract states that 'Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines,' yet supplies no quantitative metrics, dataset descriptions, ablation studies, or numerical results; this absence directly undermines the empirical support for the central claim of superiority and must be addressed with concrete tables or figures.

    Authors: We agree that the manuscript as submitted lacks the required quantitative support. In the revised version we will add a complete Experiments section containing: (i) explicit dataset descriptions and splits, (ii) quantitative metrics for inter-shot consistency (e.g., identity preservation, geometric alignment) and editability (e.g., camera and asset manipulation precision), (iii) direct numerical comparisons against the cited diffusion-based and 3D-grounded baselines, and (iv) ablation studies on each stage of the pipeline. These results will be presented in tables and figures. revision: yes

  2. Referee: [§3] §3 (three-stage pipeline and verification loop): the claim of 'unwavering multi-shot continuity' rests on the hierarchical agent verification loop successfully correcting spatial hallucinations via engine feedback, but the manuscript reports no iteration counts, correction success rates, or measured drift accumulation across shots; without these quantities the reliability of the loop for long-horizon consistency remains unverified.

    Authors: We concur that the reliability of the verification loop must be quantified. We will expand §3 with empirical statistics collected during our experiments, including average iteration counts per shot, correction success rates for spatial hallucinations, and measured drift accumulation across multi-shot sequences. These data will be reported in the revised manuscript together with the new experimental tables. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural pipeline with no derivations or self-referential reductions

full rationale

The paper describes StoryBlender as a three-stage procedural pipeline (Semantic-Spatial Grounding, Canonical Asset Materialization, Spatial-Temporal Dynamics) orchestrated by agents in a hierarchical verification loop that uses engine-verified feedback. No equations, fitted parameters, or mathematical derivations appear in the provided text. Claims of inter-shot consistency and editability are presented as outcomes of the architecture and experiments rather than quantities defined by or reduced to the inputs themselves. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The system is self-contained as an engineering framework without reducing predictions to fitted inputs or self-definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the named pipeline stages; the continuity memory graph and Story-centric Reflection Scheme are presented as novel system components without independent evidence supplied.

pith-pipeline@v0.9.0 · 5609 in / 1043 out tokens · 43795 ms · 2026-05-13T23:21:57.892977+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

205 extracted references · 205 canonical work pages · 6 internal anchors

  1. [1]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  2. [2]

    Celen, A., Han, G., Schindler, K., Gool, L.V ., Armeni, I., Obukhov, A., Wang, X.: I-design: Personalized llm interior designer (2024)

  3. [3]

    arXiv preprint arXiv:2408.17424 (2024)

    Chen, Y ., Rao, A., Jiang, X., Xiao, S., Ma, R., Wang, Z., Xiong, H., Dai, B.: Cinepregen: Camera controllable video previsualization via engine-powered diffusion. arXiv preprint arXiv:2408.17424 (2024)

  4. [4]

    Dinkevich, D., Levy, M., Avrahami, O., Samuel, D., Lischinski, D.: Story2board: A training-free approach for expressive storyboard generation (2025),https://arxiv.org/abs/2508.09983

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Gao, S., Mathew, S., Mi, L., Mamooler, S., Zhao, M., Wakaki, H., Mitsufuji, Y ., Montariol, S., Bosselut, A.: Vinabench: Benchmark for faithful and consistent visual narratives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  6. [6]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11874–11891 (2025)

    He, H., Yang, H., Tuo, Z., Zhou, Y ., Wang, Q., Zhang, Y ., Liu, Z., Huang, W., Chao, H., Yin, J.: Dreamstory: Open-domain story visualization by llm-guided multi-subject consistent diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11874–11891 (2025). https://doi.org/10.1109/TPAMI.2025.3600149

  7. [7]

    Transactions of the Association for Computational Linguistics11, 565–581 (06 2023)

    Hong, X., Sayeed, A., Mehra, K., Demberg, V ., Schiele, B.: Visual writing prompts: Character-grounded story generation with curated image sequences. Transactions of the Association for Computational Linguistics11, 565–581 (06 2023). https://doi.org/10.1162/tacl_a_00553,https://doi.org/10.1162/tacl_a_00553

  8. [8]

    arXiv preprint arXiv:2602.03838 (2026)

    Hu, E., Brudy, F., Ledo, D., Fitzmaurice, G., Anderson, F.: Previzwhiz: Combining rough 3d scenes and 2d video to guide generative video previsualization. arXiv preprint arXiv:2602.03838 (2026)

  9. [9]

    In: Forty-first International Conference on Machine Learning (2024)

    Hu, Z., Iscen, A., Jain, A., Kipf, T., Yue, Y ., Ross, D.A., Schmid, C., Fathi, A.: Scenecraft: An llm agent for synthesizing 3d scenes as blender code. In: Forty-first International Conference on Machine Learning (2024)

  10. [10]

    Huang, L., Wang, W., Wu, Z.F., Shi, Y ., Dou, H., Liang, C., Feng, Y ., Liu, Y ., Zhou, J.: In-context lora for diffusion transformers (2024),https://arxiv.org/abs/2410.23775

  11. [11]

    Huang, Y .H., Wang, W., Huang, S.Y ., Wang, Y .C.F.: Toward scene graph and layout guided complex 3d scene generation (2024),https://arxiv.org/abs/2412.20473

  12. [12]

    arXiv preprint arXiv:2408.11801 (2024)

    Huang, Y ., Qin, Y ., Lu, S., Wang, X., Huang, R., Shan, Y ., Zhang, R.: Story3d-agent: Exploring 3d storytelling visualization with large language models. arXiv preprint arXiv:2408.11801 (2024)

  13. [13]

    arXiv preprint arXiv:2506.16504 , year=

    Lai, Z., Zhao, Y ., Liu, H., Zhao, Z., Lin, Q., Shi, H., Yang, X., Yang, M., Yang, S., Feng, Y ., et al.: Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details. arXiv preprint arXiv:2506.16504 (2025)

  14. [14]

    Lin, G., Huang, K., Liu, M., Gao, R., Chen, H., Chen, L., Lu, B., Komura, T., Liu, Y ., Zhu, J.Y ., Li, M.: Pat3d: Physics-augmented text-to-3d scene generation (2025),https://arxiv.org/abs/2511.21978

  15. [15]

    Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

    Lin, H., Zala, A., Cho, J., Bansal, M.: Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091 (2023)

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Liu, C., Wu, H., Zhong, Y ., Zhang, X., Wang, Y ., Xie, W.: Intelligent grimm - open-ended visual storytelling via latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6190–6200 (June 2024)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, C., Wu, H., Zhong, Y ., Zhang, X., Wang, Y ., Xie, W.: Intelligent grimm-open-ended visual storytelling via latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6190–6200 (2024)

  18. [18]

    Liu, X., Tang, C.K., Tai, Y .W.: Worldcraft: Photo-realistic 3d world creation and customization via llm agents (2025),https://arxiv.org/abs/2502.15601

  19. [19]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Liu, Y ., Zhang, K., Li, Y ., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y ., Sun, H., Gao, J., et al.: Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177 (2024) 11 APREPRINT- APRIL7, 2026

  20. [20]

    arXiv preprint arXiv:2410.06244 (2024)

    Mao, J., Huang, X., Xie, Y ., Chang, Y ., Hui, M., Xu, B., Zhou, Y .: Story-adapter: A training-free iterative framework for long story visualization. arXiv preprint arXiv:2410.06244 (2024)

  21. [21]

    In: Proceedings of the AAAI conference on artificial intelligence

    Mou, C., Wang, X., Xie, L., Wu, Y ., Zhang, J., Qi, Z., Shan, Y .: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 4296–4304 (2024)

  22. [22]

    In: Proceedings of the 20th annual conference on Computer graphics and interactive techniques

    Nishita, T., Sirai, T., Tadamura, K., Nakamae, E.: Display of the earth taking into account atmospheric scattering. In: Proceedings of the 20th annual conference on Computer graphics and interactive techniques. pp. 175–182 (1993)

  23. [23]

    arXiv preprint arXiv:2407.20727 (2024)

    Öcal, B.M., Tatarchenko, M., Karaoglu, S., Gevers, T.: Sceneteller: Language-to-3d scene generation. arXiv preprint arXiv:2407.20727 (2024)

  24. [24]

    Ran, X., Li, Y ., Xu, L., Yu, M., Dai, B.: Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning (2025),https://arxiv.org/abs/2506.05341

  25. [25]

    In: ACM SIGGRAPH 2023 Posters, pp

    Rao, A., Jiang, X., Guo, Y ., Xu, L., Yang, L., Jin, L., Lin, D., Dai, B.: Dynamic storyboard generation in an engine-based virtual environment for video production. In: ACM SIGGRAPH 2023 Posters, pp. 1–2. Association for Computing Machinery (2023)

  26. [26]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

    Shi, H., Li, Y ., Chen, X., Wang, L., Hu, B., Zhang, M.: Animaker: Multi-agent animated storytelling with mcts-driven clip generation. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025)

  27. [27]

    In: Findings of the Association for Computational Linguistics: ACL 2025

    SongTang, S., Zhao, K., Wang, L., Li, Y ., Liu, X., Zou, J., Wang, Q., Chu, X.: Unrealllm: Towards highly controllable and interactable 3d scene generation by llm-powered procedural content generation. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 19417–19435 (2025)

  28. [28]

    Sun, F.Y ., Wu, S., Jacobsen, C., Yim, T., Zou, H., Zook, A., Li, S., Chou, Y .H., Can, E., Wu, X., Eppner, C., Blukis, V ., Tremblay, J., Wu, J., Birchfield, S., Haber, N.: 3d-generalist: Self-improving vision-language-action models for crafting 3d worlds (2025),https://arxiv.org/abs/2507.06484

  29. [29]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  30. [30]

    Tewel, Y ., Kaduri, O., Gal, R., Kasten, Y ., Wolf, L., Chechik, G., Atzmon, Y .: Training-free consistent text-to- image generation (2024),https://arxiv.org/abs/2402.03286

  31. [31]

    In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology

    Wei, Z., Wu, H., Zhang, L., Xu, X., Zheng, Y ., Hui, P., Agrawala, M., Qu, H., Rao, A.: Cinevision: An interactive pre-visualization storyboard system for director–cinematographer collaboration. In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. pp. 1–18 (2025)

  32. [32]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y ., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  33. [33]

    arXiv preprint arXiv:2503.07314 (2025)

    Wu, W., Zhu, Z., Shou, M.Z.: Automated movie generation via multi-agent cot planning. arXiv preprint arXiv:2503.07314 (2025)

  34. [34]

    Xiao, S., Liu, Z., Zhang, P., Muennighoff, N.: C-pack: Packaged resources to advance general chinese embedding (2023)

  35. [35]

    arXiv preprint arXiv:2404.15909 (2024)

    Xie, J., Feng, J., Tian, Z., Lin, K.Q., Huang, Y ., Xia, X., Gong, N., Zuo, X., Yang, J., Zheng, Y ., et al.: Learning long-form video prior via generative pre-training. arXiv preprint arXiv:2404.15909 (2024)

  36. [36]

    Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework.arXiv preprint arXiv:2408.11788,

    Xie, Z., Tang, D., Tan, D., Klein, J., Bissyand, T.F., Ezzini, S.: Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework. arXiv preprint arXiv:2408.11788 (2024)

  37. [37]

    Yang, Y ., Jia, B., Zhang, S., Huang, S.: Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent (2025),https://arxiv.org/abs/2509.20414

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, Y ., Jia, B., Zhi, P., Huang, S.: Physcene: Physically interactable 3d scene synthesis for embodied ai. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16262–16272 (2024)

  39. [39]

    ACM Transactions on Graphics (TOG)44(4), 1–19 (2025)

    Yao, K., Zhang, L., Yan, X., Zeng, Y ., Zhang, Q., Xu, L., Yang, W., Gu, J., Yu, J.: Cast: Component-aligned 3d scene reconstruction from an rgb image. ACM Transactions on Graphics (TOG)44(4), 1–19 (2025)

  40. [40]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

  41. [41]

    Zhang, K., Jiang, L., Wang, A., Fang, J.Z., Zhi, T., Yan, Q., Kang, H., Lu, X., Pan, X.: Storymem: Multi-shot long video storytelling with memory (2025),https://arxiv.org/abs/2512.19539

  42. [42]

    org/abs/2506.21272 12 APREPRINT- APRIL7, 2026

    Zheng, J., Cun, X.: Fairygen: Storied cartoon video from a single child-drawn character (2025), https://arxiv. org/abs/2506.21272 12 APREPRINT- APRIL7, 2026

  43. [43]

    ACM Transactions on Graphics (TOG)44(4), 1–15 (2025)

    Zhong, L., Guo, C., Xie, Y ., Wang, J., Li, C.: Sketch2anim: Towards transferring sketch storyboards into 3d animation. ACM Transactions on Graphics (TOG)44(4), 1–15 (2025)

  44. [44]

    Advances in Neural Information Processing Systems37, 110315–110340 (2024)

    Zhou, Y ., Zhou, D., Cheng, M.M., Feng, J., Hou, Q.: Storydiffusion: Consistent self-attention for long-range image and video generation. Advances in Neural Information Processing Systems37, 110315–110340 (2024)

  45. [45]

    scene” and the cinematic concept of a “shot

    Zhuang, C., Huang, A., Cheng, W., Wu, J., Hu, Y ., Liao, J., Wang, H., Liao, X., Cai, W., Xu, H., et al.: Vistorybench: Comprehensive benchmark suite for story visualization. arXiv preprint arXiv:2505.24862 (2025) Appendix A Implementation Details of Experiments 13 B User Study Analysis 16 C Ablation Study Analysis and Visualization 17 D Extended Case Stu...

  46. [46]

    Rick and Ugarte discuss the letters of transit in the bustling cafe while German soldiers watch

  47. [47]

    Ilsa requests a song from Sam, leading to a tense reunion with Rick

  48. [48]

    A musical duel erupts between German officers and the patrons led by Laszlo

  49. [49]

    Ilsa confronts Rick in his apartment with a gun to get the letters

  50. [50]

    Rick says goodbye to Ilsa and Laszlo on the foggy tarmac. •Movie 2: Good Will Hunting – ID: M02 Summary:A brilliant but emotionally guarded young janitor from South Boston discovers his extraordinary mathematical talent while struggling with trauma, friendship, and love. Through his relationship with therapist Sean Maguire, Will gradually learns to confro...

  51. [51]

    Will solves a complex math problem on a blackboard in a university corridor at night

  52. [52]

    Will defends his friend Chuckie from an arrogant student in a crowded bar

  53. [53]

    Will and Sean share a conversation on a park bench by a pond

  54. [54]

    An emotional breakthrough in Sean’s office where Will breaks down

  55. [55]

    •Movie 3: L.A

    Sean finds a letter and Will drives away to find Skylar. •Movie 3: L.A. Confidential – ID: M03 Summary:In 1950s Los Angeles, three distinct detectives navigate a web of corruption, murder, and deception following a diner massacre. As they uncover a conspiracy involving their own police captain, they are forced to confront their differences and unite to su...

  56. [56]

    During a precinct party, tensions flare between the contrasting detectives

  57. [57]

    Ed and Bud investigate the grisly aftermath of the Night Owl massacre at a diner

  58. [58]

    Bud White uses intimidation tactics on a suspect in the interrogation room

  59. [59]

    Bud encounters Lynn Bracken in a luxurious dressing room

  60. [60]

    Ed and Jack search for clues in the narrow basement records room

  61. [61]

    Bud confronts Ed in a dark, rainy alleyway behind a theater

  62. [62]

    Captain Dudley Smith betrays Jack Vincennes in his shadowy home office

  63. [63]

    Ed and Bud prepare for a desperate standoff inside an abandoned motel room

  64. [64]

    23 APREPRINT- APRIL7, 2026

    Ed confronts Dudley in a tense final gunfight on the motel porch. 23 APREPRINT- APRIL7, 2026

  65. [65]

    •Movie 4: La La Land – ID: M04 Summary:A jazz pianist and an aspiring actress fall in love while chasing artistic dreams in Los Angeles

    In the aftermath, the survivors gather on the steps of City Hall. •Movie 4: La La Land – ID: M04 Summary:A jazz pianist and an aspiring actress fall in love while chasing artistic dreams in Los Angeles. Their relationship evolves through moments of enchantment, ambition, and sacrifice, revealing the tension between personal love and creative fulfillment

  66. [66]

    On a sunny traffic-jammed highway, Mia and Sebastian have a tense moment amidst a musical number

  67. [67]

    Mia and Sebastian dance on a twilight hilltop overlooking the city

  68. [68]

    A romantic waltz inside a planetarium with floating stars

  69. [69]

    Mia gives an emotional singing performance during an audition

  70. [70]

    •Movie 5: The Truman Show – ID: M05 Summary:Truman Burbank lives an apparently perfect suburban life, unaware that his entire existence is a massive televised fabrication

    Years later, Mia and Sebastian share a final acknowledgment at his jazz club. •Movie 5: The Truman Show – ID: M05 Summary:Truman Burbank lives an apparently perfect suburban life, unaware that his entire existence is a massive televised fabrication. As strange inconsistencies accumulate, he begins to question reality and ultimately seeks freedom beyond th...

  71. [71]

    Truman greets his neighbors on a sunny morning in a perfect suburban neighborhood

  72. [72]

    A studio light falls from the sky onto the street, confusing Truman

  73. [73]

    Meryl performs a bizarre product placement advertisement in their kitchen during an argument

  74. [74]

    Truman tries to book a flight, but the travel agent acts as a gatekeeper

  75. [75]

    •Movie 6: The Terminal – ID: M06 Summary:After a political crisis invalidates his passport, Viktor Navorski becomes stranded inside a New York airport terminal

    Truman reaches the edge of his world and exits the set. •Movie 6: The Terminal – ID: M06 Summary:After a political crisis invalidates his passport, Viktor Navorski becomes stranded inside a New York airport terminal. Forced to survive in transit, he gradually builds a temporary life through resilience, kindness, and unexpected human connections

  76. [76]

    Viktor is denied entry at the airport immigration checkpoint

  77. [77]

    Viktor sets up a temporary home in a renovation zone

  78. [78]

    Viktor repairs a damaged wall in the terminal

  79. [79]

    Viktor and Amelia have a romantic dinner overlooking the tarmac

  80. [80]

    Viktor finally leaves the airport into the snowy night. •Movie 7: Pulp Fiction – ID: M07 Summary:A nonlinear crime story unfolds across Los Angeles as hitmen, a boxer, and a gangster’s wife drift through a chain of violent, absurd, and darkly comic encounters. The film weaves together multiple narratives into a stylized portrait of fate, chance, and under...

Showing first 80 references.