arxiv: 2604.03315 · v1 · submitted 2026-04-01 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics

Bingliang Li , Zhenhong Sun , Jiaming Bian , Yuehao Wu , Yifu Wang , Hongdong Li , Yatao Bian , Huadong Mo

show 1 more author

Daoyi Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D storyboard generationinter-shot consistencyeditable 3D scenesmulti-agent verificationspatial-temporal dynamicscontinuity memory graphstory-centric reflection

0 comments

The pith

StoryBlender builds native 3D storyboards that stay consistent across shots while allowing direct edits to cameras and assets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StoryBlender as a framework that turns text or story descriptions into 3D scenes usable for film pre-visualization. It separates global story elements from per-shot details through a continuity memory graph, places every character and object into one shared coordinate system, and then evolves camera and layout choices according to explicit visual metrics. A multi-agent loop checks outputs against a 3D engine and fixes spatial errors before they accumulate. The end result is a single 3D environment that can be tweaked shot by shot without breaking visual identity or continuity. This matters because current 2D generators drift in appearance and traditional 3D tools demand too much manual labor.

Core claim

StoryBlender is a three-stage pipeline governed by a Story-centric Reflection Scheme. Semantic-Spatial Grounding builds a continuity memory graph that decouples shared assets from shot variables. Canonical Asset Materialization places every entity into a single coordinate frame to lock visual identity. Spatial-Temporal Dynamics then uses visual metrics to design layouts and camera moves. A hierarchical agent verification loop feeds engine-rendered checks back into the system, allowing it to self-correct spatial hallucinations while preserving long-horizon multi-shot continuity in the final native 3D scenes.

What carries the argument

The Story-centric Reflection Scheme, a hierarchical agent verification loop that iteratively corrects spatial hallucinations using engine-verified feedback.

If this is right

Native 3D output lets users edit cameras and visual assets directly while keeping multi-shot continuity intact.
The continuity memory graph decouples global assets from shot-specific variables, supporting long sequences without identity drift.
Visual-metric-driven layout and cinematic evolution replace manual key-framing for camera and staging decisions.
The system improves measurable consistency and editability compared with both 2D diffusion baselines and prior 3D-grounded methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reflection loop could be reused for other long-horizon 3D generation tasks such as consistent character animation across scenes.
Because edits happen inside a single 3D coordinate frame, the approach may support real-time collaborative storyboarding in VR.
If the verification loop scales, it could lower the barrier for non-experts to produce professional-grade pre-visualization without learning traditional 3D software.
Extending the visual metrics to include narrative pacing or emotional tone would be a direct next step the current pipeline leaves open.

Load-bearing premise

The hierarchical agent loop can reliably detect and fix spatial errors without creating new drift or needing repeated manual fixes.

What would settle it

Generate a five-shot sequence, apply a small camera or asset edit in shot three, then render all shots from the same 3D scene and check whether character appearance or spatial layout remains identical across the sequence.

Figures

Figures reproduced from arXiv: 2604.03315 by Bingliang Li, Daoyi Dong, Hongdong Li, Huadong Mo, Jiaming Bian, Yatao Bian, Yifu Wang, Yuehao Wu, Zhenhong Sun.

**Figure 1.** Figure 1: Overview of StoryBlender compared to existing storyboarding methods. Left: diffusion-based generation in pixel space; improving consistency typically requires reference inputs. Middle: traditional 3D workflow; strong control but a complex, labor-intensive pipeline. Right (ours): StoryBlender uses a hierarchical multi-agent planning framework to create consistent, editable 3D storyboards across shots. (char… view at source ↗

**Figure 2.** Figure 2: Hierarchical Multi-Agent Planning Framework. Governed by a Story-centric Reflection Scheme (b), our system utilizes iterative feedback from 3D engines (e.g., Blender) and Vision-Language Models to ensure geometric and narrative consistency. We translate narrative Tstory into 3D storyboards V3D via a three-stage pipeline: (a) SemanticSpatial Grounding, where the Director Agent decomposes the story into a s… view at source ↗

**Figure 3.** Figure 3: Comparison of baselines on a complex multi-shot sequence from the film Casablanca. StoryBlender demonstrates stronger geometric consistency and entity management across shots, maintaining the architectural layout and correct character count in each frame. In contrast, StoryDiffusion and Story2Board capture the general semantic atmosphere but exhibit spatial inconsistencies, hallucinating background changes… view at source ↗

**Figure 4.** Figure 4: Impact of Story-centric Reflection for the Concept Artist and Visual Effects Artist Agents. Visual Effects Artist Agents. To assess the impact of specialized roles, we remove the Visual Effects Artist Agents from the pipeline. We replace the atmospheric lighting and character animations with a default Nishita sky texture [22] and static rest poses, which caused Prompt Alignment scores to drop (see [PIT… view at source ↗

**Figure 5.** Figure 5: Impact of Story-centric Reflection for the Layout Artist Agents. Compared to a naive reflection approach, 3D physical reflection rapidly minimizes spatial errors across reflective turns. Original Shot Editing with Agents Camera New camera: "Close-up on Ugarte's face to emphasize his nervous glances and sweating." (Camera added: 56mm F1.8) Scene Change wall description to: "Wooden planks wall" (Wall textu… view at source ↗

**Figure 6.** Figure 6: Non-destructive editing workflow. Starting from an initial generation (left), the figure shows two modes of modification. Middle: agent-assisted edits, where natural language updates camera placement, lighting, and textures without changing unrelated geometry. Right: manual edits in the 3D engine for asset addition/removal, asset transformation, and camera adjustment. This dual-mode design demonstrates pre… view at source ↗

read the original abstract

Storyboarding is a core skill in visual storytelling for film, animation, and games. However, automating this process requires a system to achieve two properties that current approaches rarely satisfy simultaneously: inter-shot consistency and explicit editability. While 2D diffusion-based generators produce vivid imagery, they often suffer from identity drift along with limited geometric control; conversely, traditional 3D animation workflows are consistent and editable but require expert-heavy, labor-intensive authoring. We present StoryBlender, a grounded 3D storyboard generation framework governed by a Story-centric Reflection Scheme. At its core, we propose the StoryBlender system, which is built on a three-stage pipeline: (1) Semantic-Spatial Grounding, to construct a continuity memory graph to decouple global assets from shot-specific variables for long-horizon consistency; (2) Canonical Asset Materialization, to instantiate entities in a unified coordinate space to maintain visual identity; and (3) Spatial-Temporal Dynamics, to achieve layout design and cinematic evolution through visual metrics. By orchestrating multiple agents in a hierarchical manner within a verification loop, StoryBlender iteratively self-corrects spatial hallucinations via engine-verified feedback. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi-shot continuity. Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines. Code, data, and demonstration video will be available on https://engineeringai-lab.github.io/StoryBlender/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StoryBlender's three-stage pipeline with agent verification is a reasonable attempt at consistent editable 3D storyboards, but missing quantitative results leave the consistency claims unproven.

read the letter

StoryBlender tries to automate 3D storyboarding with inter-shot consistency and editability built in. The punchline is that the three-stage pipeline plus the agent verification loop is an interesting way to handle long-horizon consistency without expert labor, but the lack of hard numbers makes it hard to tell how well it actually works. What stands out as new is the continuity memory graph that decouples global assets from shot variables, combined with canonical materialization in a unified space and the hierarchical agent loop that uses engine feedback to correct hallucinations. That setup directly targets the drift problem in diffusion methods while keeping the output editable in 3D. The paper does well at laying out why both properties matter for real workflows in film and games. The description of spatial-temporal dynamics for layout and cinematic evolution shows they thought through the practical side. The main soft spot is the experimental section. It claims significant improvements over baselines, but the abstract gives no metrics, no dataset info, and no details on how many iterations the loop needs or what the drift rates look like. Without those, the unwavering continuity claim rests on the assumption that the reflection scheme always converges cleanly, which the stress-test note rightly flags as the weakest link. This paper is for computer vision researchers and graphics folks who work on generative 3D content for storytelling. Anyone trying to move beyond single-shot generation would find the architecture worth looking at. I would bring it to a reading group to discuss the agent orchestration part. I wouldn't cite it yet because the results aren't quantified enough to build on. But yes, a serious editor should send it for peer review so the authors can add the missing ablations and measurements.

Referee Report

2 major / 2 minor

Summary. The paper presents StoryBlender, a grounded 3D storyboard generation framework that employs a Story-centric Reflection Scheme and a three-stage pipeline—Semantic-Spatial Grounding to build a continuity memory graph, Canonical Asset Materialization to instantiate entities in a unified coordinate space, and Spatial-Temporal Dynamics for layout and cinematic evolution—together with a hierarchical agent verification loop that uses engine-verified feedback to self-correct spatial hallucinations, yielding native 3D scenes with inter-shot consistency and direct editability of cameras and assets.

Significance. If the central claims hold, the work would offer a practical advance in automated visual storytelling by combining the consistency of 3D representations with explicit editability, potentially lowering the labor barrier in film, animation, and game pre-production relative to both diffusion-based and traditional 3D pipelines. The procedural, engine-grounded verification loop is a methodological strength that could generalize to other long-horizon 3D generation tasks.

major comments (2)

[Experiments section] The abstract states that 'Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines,' yet supplies no quantitative metrics, dataset descriptions, ablation studies, or numerical results; this absence directly undermines the empirical support for the central claim of superiority and must be addressed with concrete tables or figures.
[§3] §3 (three-stage pipeline and verification loop): the claim of 'unwavering multi-shot continuity' rests on the hierarchical agent verification loop successfully correcting spatial hallucinations via engine feedback, but the manuscript reports no iteration counts, correction success rates, or measured drift accumulation across shots; without these quantities the reliability of the loop for long-horizon consistency remains unverified.

minor comments (2)

The abstract promises that 'Code, data, and demonstration video will be available' at a GitHub link, but the link is not expanded and no supplementary material is referenced.
[§2] Notation for the continuity memory graph and unified coordinate space is introduced without an accompanying diagram or formal definition in the early sections, making the pipeline flow harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the current manuscript requires substantial strengthening of the empirical section and additional quantitative details on the verification loop. We will revise the paper to address both points fully.

read point-by-point responses

Referee: [Experiments section] The abstract states that 'Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines,' yet supplies no quantitative metrics, dataset descriptions, ablation studies, or numerical results; this absence directly undermines the empirical support for the central claim of superiority and must be addressed with concrete tables or figures.

Authors: We agree that the manuscript as submitted lacks the required quantitative support. In the revised version we will add a complete Experiments section containing: (i) explicit dataset descriptions and splits, (ii) quantitative metrics for inter-shot consistency (e.g., identity preservation, geometric alignment) and editability (e.g., camera and asset manipulation precision), (iii) direct numerical comparisons against the cited diffusion-based and 3D-grounded baselines, and (iv) ablation studies on each stage of the pipeline. These results will be presented in tables and figures. revision: yes
Referee: [§3] §3 (three-stage pipeline and verification loop): the claim of 'unwavering multi-shot continuity' rests on the hierarchical agent verification loop successfully correcting spatial hallucinations via engine feedback, but the manuscript reports no iteration counts, correction success rates, or measured drift accumulation across shots; without these quantities the reliability of the loop for long-horizon consistency remains unverified.

Authors: We concur that the reliability of the verification loop must be quantified. We will expand §3 with empirical statistics collected during our experiments, including average iteration counts per shot, correction success rates for spatial hallucinations, and measured drift accumulation across multi-shot sequences. These data will be reported in the revised manuscript together with the new experimental tables. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural pipeline with no derivations or self-referential reductions

full rationale

The paper describes StoryBlender as a three-stage procedural pipeline (Semantic-Spatial Grounding, Canonical Asset Materialization, Spatial-Temporal Dynamics) orchestrated by agents in a hierarchical verification loop that uses engine-verified feedback. No equations, fitted parameters, or mathematical derivations appear in the provided text. Claims of inter-shot consistency and editability are presented as outcomes of the architecture and experiments rather than quantities defined by or reduced to the inputs themselves. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The system is self-contained as an engineering framework without reducing predictions to fitted inputs or self-definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the named pipeline stages; the continuity memory graph and Story-centric Reflection Scheme are presented as novel system components without independent evidence supplied.

pith-pipeline@v0.9.0 · 5609 in / 1043 out tokens · 43795 ms · 2026-05-13T23:21:57.892977+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-stage pipeline: (1) Semantic-Spatial Grounding, to construct a continuity memory graph... (2) Canonical Asset Materialization... (3) Spatial-Temporal Dynamics... iteratively self-corrects spatial hallucinations via engine-verified feedback
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Story-centric Reflection Scheme... reflection scoring function R(ot,C) ... closed-loop verification system

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

205 extracted references · 205 canonical work pages · 6 internal anchors

[1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Celen, A., Han, G., Schindler, K., Gool, L.V ., Armeni, I., Obukhov, A., Wang, X.: I-design: Personalized llm interior designer (2024)

work page 2024
[3]

arXiv preprint arXiv:2408.17424 (2024)

Chen, Y ., Rao, A., Jiang, X., Xiao, S., Ma, R., Wang, Z., Xiong, H., Dai, B.: Cinepregen: Camera controllable video previsualization via engine-powered diffusion. arXiv preprint arXiv:2408.17424 (2024)

work page arXiv 2024
[4]

Dinkevich, D., Levy, M., Avrahami, O., Samuel, D., Lischinski, D.: Story2board: A training-free approach for expressive storyboard generation (2025),https://arxiv.org/abs/2508.09983

work page arXiv 2025
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Gao, S., Mathew, S., Mi, L., Mamooler, S., Zhao, M., Wakaki, H., Mitsufuji, Y ., Montariol, S., Bosselut, A.: Vinabench: Benchmark for faithful and consistent visual narratives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

work page 2025
[6]

IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11874–11891 (2025)

He, H., Yang, H., Tuo, Z., Zhou, Y ., Wang, Q., Zhang, Y ., Liu, Z., Huang, W., Chao, H., Yin, J.: Dreamstory: Open-domain story visualization by llm-guided multi-subject consistent diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11874–11891 (2025). https://doi.org/10.1109/TPAMI.2025.3600149

work page doi:10.1109/tpami.2025.3600149 2025
[7]

Transactions of the Association for Computational Linguistics11, 565–581 (06 2023)

Hong, X., Sayeed, A., Mehra, K., Demberg, V ., Schiele, B.: Visual writing prompts: Character-grounded story generation with curated image sequences. Transactions of the Association for Computational Linguistics11, 565–581 (06 2023). https://doi.org/10.1162/tacl_a_00553,https://doi.org/10.1162/tacl_a_00553

work page doi:10.1162/tacl_a_00553 2023
[8]

arXiv preprint arXiv:2602.03838 (2026)

Hu, E., Brudy, F., Ledo, D., Fitzmaurice, G., Anderson, F.: Previzwhiz: Combining rough 3d scenes and 2d video to guide generative video previsualization. arXiv preprint arXiv:2602.03838 (2026)

work page arXiv 2026
[9]

In: Forty-first International Conference on Machine Learning (2024)

Hu, Z., Iscen, A., Jain, A., Kipf, T., Yue, Y ., Ross, D.A., Schmid, C., Fathi, A.: Scenecraft: An llm agent for synthesizing 3d scenes as blender code. In: Forty-first International Conference on Machine Learning (2024)

work page 2024
[10]

Huang, L., Wang, W., Wu, Z.F., Shi, Y ., Dou, H., Liang, C., Feng, Y ., Liu, Y ., Zhou, J.: In-context lora for diffusion transformers (2024),https://arxiv.org/abs/2410.23775

work page arXiv 2024
[11]

Huang, Y .H., Wang, W., Huang, S.Y ., Wang, Y .C.F.: Toward scene graph and layout guided complex 3d scene generation (2024),https://arxiv.org/abs/2412.20473

work page arXiv 2024
[12]

arXiv preprint arXiv:2408.11801 (2024)

Huang, Y ., Qin, Y ., Lu, S., Wang, X., Huang, R., Shan, Y ., Zhang, R.: Story3d-agent: Exploring 3d storytelling visualization with large language models. arXiv preprint arXiv:2408.11801 (2024)

work page arXiv 2024
[13]

arXiv preprint arXiv:2506.16504 , year=

Lai, Z., Zhao, Y ., Liu, H., Zhao, Z., Lin, Q., Shi, H., Yang, X., Yang, M., Yang, S., Feng, Y ., et al.: Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details. arXiv preprint arXiv:2506.16504 (2025)

work page arXiv 2025
[14]

Lin, G., Huang, K., Liu, M., Gao, R., Chen, H., Chen, L., Lu, B., Komura, T., Liu, Y ., Zhu, J.Y ., Li, M.: Pat3d: Physics-augmented text-to-3d scene generation (2025),https://arxiv.org/abs/2511.21978

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

Lin, H., Zala, A., Cho, J., Bansal, M.: Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091 (2023)

work page arXiv 2023
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, C., Wu, H., Zhong, Y ., Zhang, X., Wang, Y ., Xie, W.: Intelligent grimm - open-ended visual storytelling via latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6190–6200 (June 2024)

work page 2024
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, C., Wu, H., Zhong, Y ., Zhang, X., Wang, Y ., Xie, W.: Intelligent grimm-open-ended visual storytelling via latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6190–6200 (2024)

work page 2024
[18]

Liu, X., Tang, C.K., Tai, Y .W.: Worldcraft: Photo-realistic 3d world creation and customization via llm agents (2025),https://arxiv.org/abs/2502.15601

work page arXiv 2025
[19]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Liu, Y ., Zhang, K., Li, Y ., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y ., Sun, H., Gao, J., et al.: Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177 (2024) 11 APREPRINT- APRIL7, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

arXiv preprint arXiv:2410.06244 (2024)

Mao, J., Huang, X., Xie, Y ., Chang, Y ., Hui, M., Xu, B., Zhou, Y .: Story-adapter: A training-free iterative framework for long story visualization. arXiv preprint arXiv:2410.06244 (2024)

work page arXiv 2024
[21]

In: Proceedings of the AAAI conference on artificial intelligence

Mou, C., Wang, X., Xie, L., Wu, Y ., Zhang, J., Qi, Z., Shan, Y .: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 4296–4304 (2024)

work page 2024
[22]

In: Proceedings of the 20th annual conference on Computer graphics and interactive techniques

Nishita, T., Sirai, T., Tadamura, K., Nakamae, E.: Display of the earth taking into account atmospheric scattering. In: Proceedings of the 20th annual conference on Computer graphics and interactive techniques. pp. 175–182 (1993)

work page 1993
[23]

arXiv preprint arXiv:2407.20727 (2024)

Öcal, B.M., Tatarchenko, M., Karaoglu, S., Gevers, T.: Sceneteller: Language-to-3d scene generation. arXiv preprint arXiv:2407.20727 (2024)

work page arXiv 2024
[24]

Ran, X., Li, Y ., Xu, L., Yu, M., Dai, B.: Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning (2025),https://arxiv.org/abs/2506.05341

work page arXiv 2025
[25]

In: ACM SIGGRAPH 2023 Posters, pp

Rao, A., Jiang, X., Guo, Y ., Xu, L., Yang, L., Jin, L., Lin, D., Dai, B.: Dynamic storyboard generation in an engine-based virtual environment for video production. In: ACM SIGGRAPH 2023 Posters, pp. 1–2. Association for Computing Machinery (2023)

work page 2023
[26]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Shi, H., Li, Y ., Chen, X., Wang, L., Hu, B., Zhang, M.: Animaker: Multi-agent animated storytelling with mcts-driven clip generation. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025)

work page 2025
[27]

In: Findings of the Association for Computational Linguistics: ACL 2025

SongTang, S., Zhao, K., Wang, L., Li, Y ., Liu, X., Zou, J., Wang, Q., Chu, X.: Unrealllm: Towards highly controllable and interactable 3d scene generation by llm-powered procedural content generation. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 19417–19435 (2025)

work page 2025
[28]

Sun, F.Y ., Wu, S., Jacobsen, C., Yim, T., Zou, H., Zook, A., Li, S., Chou, Y .H., Can, E., Wu, X., Eppner, C., Blukis, V ., Tremblay, J., Wu, J., Birchfield, S., Haber, N.: 3d-generalist: Self-improving vision-language-action models for crafting 3d worlds (2025),https://arxiv.org/abs/2507.06484

work page arXiv 2025
[29]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Tewel, Y ., Kaduri, O., Gal, R., Kasten, Y ., Wolf, L., Chechik, G., Atzmon, Y .: Training-free consistent text-to- image generation (2024),https://arxiv.org/abs/2402.03286

work page arXiv 2024
[31]

In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology

Wei, Z., Wu, H., Zhang, L., Xu, X., Zheng, Y ., Hui, P., Agrawala, M., Qu, H., Rao, A.: Cinevision: An interactive pre-visualization storyboard system for director–cinematographer collaboration. In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. pp. 1–18 (2025)

work page 2025
[32]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y ., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

arXiv preprint arXiv:2503.07314 (2025)

Wu, W., Zhu, Z., Shou, M.Z.: Automated movie generation via multi-agent cot planning. arXiv preprint arXiv:2503.07314 (2025)

work page arXiv 2025
[34]

Xiao, S., Liu, Z., Zhang, P., Muennighoff, N.: C-pack: Packaged resources to advance general chinese embedding (2023)

work page 2023
[35]

arXiv preprint arXiv:2404.15909 (2024)

Xie, J., Feng, J., Tian, Z., Lin, K.Q., Huang, Y ., Xia, X., Gong, N., Zuo, X., Yang, J., Zheng, Y ., et al.: Learning long-form video prior via generative pre-training. arXiv preprint arXiv:2404.15909 (2024)

work page arXiv 2024
[36]

Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework.arXiv preprint arXiv:2408.11788,

Xie, Z., Tang, D., Tan, D., Klein, J., Bissyand, T.F., Ezzini, S.: Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework. arXiv preprint arXiv:2408.11788 (2024)

work page arXiv 2024
[37]

Yang, Y ., Jia, B., Zhang, S., Huang, S.: Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent (2025),https://arxiv.org/abs/2509.20414

work page arXiv 2025
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, Y ., Jia, B., Zhi, P., Huang, S.: Physcene: Physically interactable 3d scene synthesis for embodied ai. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16262–16272 (2024)

work page 2024
[39]

ACM Transactions on Graphics (TOG)44(4), 1–19 (2025)

Yao, K., Zhang, L., Yan, X., Zeng, Y ., Zhang, Q., Xu, L., Yang, W., Gu, J., Yu, J.: Cast: Component-aligned 3d scene reconstruction from an rgb image. ACM Transactions on Graphics (TOG)44(4), 1–19 (2025)

work page 2025
[40]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Zhang, K., Jiang, L., Wang, A., Fang, J.Z., Zhi, T., Yan, Q., Kang, H., Lu, X., Pan, X.: Storymem: Multi-shot long video storytelling with memory (2025),https://arxiv.org/abs/2512.19539

work page arXiv 2025
[42]

org/abs/2506.21272 12 APREPRINT- APRIL7, 2026

Zheng, J., Cun, X.: Fairygen: Storied cartoon video from a single child-drawn character (2025), https://arxiv. org/abs/2506.21272 12 APREPRINT- APRIL7, 2026

work page arXiv 2025
[43]

ACM Transactions on Graphics (TOG)44(4), 1–15 (2025)

Zhong, L., Guo, C., Xie, Y ., Wang, J., Li, C.: Sketch2anim: Towards transferring sketch storyboards into 3d animation. ACM Transactions on Graphics (TOG)44(4), 1–15 (2025)

work page 2025
[44]

Advances in Neural Information Processing Systems37, 110315–110340 (2024)

Zhou, Y ., Zhou, D., Cheng, M.M., Feng, J., Hou, Q.: Storydiffusion: Consistent self-attention for long-range image and video generation. Advances in Neural Information Processing Systems37, 110315–110340 (2024)

work page 2024
[45]

scene” and the cinematic concept of a “shot

Zhuang, C., Huang, A., Cheng, W., Wu, J., Hu, Y ., Liao, J., Wang, H., Liao, X., Cai, W., Xu, H., et al.: Vistorybench: Comprehensive benchmark suite for story visualization. arXiv preprint arXiv:2505.24862 (2025) Appendix A Implementation Details of Experiments 13 B User Study Analysis 16 C Ablation Study Analysis and Visualization 17 D Extended Case Stu...

work page arXiv 2025
[46]

Rick and Ugarte discuss the letters of transit in the bustling cafe while German soldiers watch

work page
[47]

Ilsa requests a song from Sam, leading to a tense reunion with Rick

work page
[48]

A musical duel erupts between German officers and the patrons led by Laszlo

work page
[49]

Ilsa confronts Rick in his apartment with a gun to get the letters

work page
[50]

Rick says goodbye to Ilsa and Laszlo on the foggy tarmac. •Movie 2: Good Will Hunting – ID: M02 Summary:A brilliant but emotionally guarded young janitor from South Boston discovers his extraordinary mathematical talent while struggling with trauma, friendship, and love. Through his relationship with therapist Sean Maguire, Will gradually learns to confro...

work page
[51]

Will solves a complex math problem on a blackboard in a university corridor at night

work page
[52]

Will defends his friend Chuckie from an arrogant student in a crowded bar

work page
[53]

Will and Sean share a conversation on a park bench by a pond

work page
[54]

An emotional breakthrough in Sean’s office where Will breaks down

work page
[55]

•Movie 3: L.A

Sean finds a letter and Will drives away to find Skylar. •Movie 3: L.A. Confidential – ID: M03 Summary:In 1950s Los Angeles, three distinct detectives navigate a web of corruption, murder, and deception following a diner massacre. As they uncover a conspiracy involving their own police captain, they are forced to confront their differences and unite to su...

work page
[56]

During a precinct party, tensions flare between the contrasting detectives

work page
[57]

Ed and Bud investigate the grisly aftermath of the Night Owl massacre at a diner

work page
[58]

Bud White uses intimidation tactics on a suspect in the interrogation room

work page
[59]

Bud encounters Lynn Bracken in a luxurious dressing room

work page
[60]

Ed and Jack search for clues in the narrow basement records room

work page
[61]

Bud confronts Ed in a dark, rainy alleyway behind a theater

work page
[62]

Captain Dudley Smith betrays Jack Vincennes in his shadowy home office

work page
[63]

Ed and Bud prepare for a desperate standoff inside an abandoned motel room

work page
[64]

23 APREPRINT- APRIL7, 2026

Ed confronts Dudley in a tense final gunfight on the motel porch. 23 APREPRINT- APRIL7, 2026

work page 2026
[65]

•Movie 4: La La Land – ID: M04 Summary:A jazz pianist and an aspiring actress fall in love while chasing artistic dreams in Los Angeles

In the aftermath, the survivors gather on the steps of City Hall. •Movie 4: La La Land – ID: M04 Summary:A jazz pianist and an aspiring actress fall in love while chasing artistic dreams in Los Angeles. Their relationship evolves through moments of enchantment, ambition, and sacrifice, revealing the tension between personal love and creative fulfillment

work page
[66]

On a sunny traffic-jammed highway, Mia and Sebastian have a tense moment amidst a musical number

work page
[67]

Mia and Sebastian dance on a twilight hilltop overlooking the city

work page
[68]

A romantic waltz inside a planetarium with floating stars

work page
[69]

Mia gives an emotional singing performance during an audition

work page
[70]

•Movie 5: The Truman Show – ID: M05 Summary:Truman Burbank lives an apparently perfect suburban life, unaware that his entire existence is a massive televised fabrication

Years later, Mia and Sebastian share a final acknowledgment at his jazz club. •Movie 5: The Truman Show – ID: M05 Summary:Truman Burbank lives an apparently perfect suburban life, unaware that his entire existence is a massive televised fabrication. As strange inconsistencies accumulate, he begins to question reality and ultimately seeks freedom beyond th...

work page
[71]

Truman greets his neighbors on a sunny morning in a perfect suburban neighborhood

work page
[72]

A studio light falls from the sky onto the street, confusing Truman

work page
[73]

Meryl performs a bizarre product placement advertisement in their kitchen during an argument

work page
[74]

Truman tries to book a flight, but the travel agent acts as a gatekeeper

work page
[75]

•Movie 6: The Terminal – ID: M06 Summary:After a political crisis invalidates his passport, Viktor Navorski becomes stranded inside a New York airport terminal

Truman reaches the edge of his world and exits the set. •Movie 6: The Terminal – ID: M06 Summary:After a political crisis invalidates his passport, Viktor Navorski becomes stranded inside a New York airport terminal. Forced to survive in transit, he gradually builds a temporary life through resilience, kindness, and unexpected human connections

work page
[76]

Viktor is denied entry at the airport immigration checkpoint

work page
[77]

Viktor sets up a temporary home in a renovation zone

work page
[78]

Viktor repairs a damaged wall in the terminal

work page
[79]

Viktor and Amelia have a romantic dinner overlooking the tarmac

work page
[80]

Viktor finally leaves the airport into the snowy night. •Movie 7: Pulp Fiction – ID: M07 Summary:A nonlinear crime story unfolds across Los Angeles as hitmen, a boxer, and a gangster’s wife drift through a chain of violent, absurd, and darkly comic encounters. The film weaves together multiple narratives into a stylized portrait of fate, chance, and under...

work page

Showing first 80 references.