pith. machine review for the scientific record. sign in

arxiv: 2605.11927 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsstorybook generationcharacter coherencephysics-informed attentionmulti-character sequencesnarrative dynamismself-attention
0
0 comments X

The pith

RealDiffusion models attention features as heat diffusion plus stochastic perturbations to keep characters consistent while letting stories evolve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix a core problem in diffusion-based image generation: when creating sequences for stories with multiple characters, models either let characters drift in appearance or freeze the action into repetitive scenes. It proposes treating the way features evolve inside self-attention layers as a simple physical system. Heat diffusion acts as a smoothing force that averages nearby features and removes high-frequency identity noise, while a light stochastic process adds small random shifts to keep the sequence from collapsing into static poses. The whole mechanism runs at inference time with no extra training, simply by altering the attention computation. If the modeling works, sequential story generation could produce longer, more reliable multi-character narratives without the usual trade-off between consistency and change.

Core claim

RealDiffusion reconciles robust character coherence with narrative dynamism by injecting a configurable physical prior into self-attention: heat diffusion averages neighboring features to suppress attribute drift and stabilize identities across frames, while a region-aware stochastic process introduces controlled perturbations that allow pose changes and scene evolution to continue.

What carries the argument

Physics-informed Attention, a training-free modification that treats feature evolution in self-attention layers as a heat-diffusion dissipative prior combined with stochastic perturbations.

If this is right

  • Suppresses attribute drift inside subject regions and stabilizes identity across sequential frames.
  • Prevents story collapse by allowing small pose and scene changes through stochastic perturbations.
  • Regularizes spatio-temporal relationships in attention without suppressing intentional prompt-driven variations.
  • Delivers measurable gains in character coherence on multi-character storybook tasks while matching or exceeding prior methods on dynamism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diffusion-plus-stochastic prior could be tested on other sequential generation tasks such as comic panels or short video clips.
  • Physical analogies of this kind offer a way to add controllable regularization to generative models without retraining or large datasets.
  • Longer story sequences might benefit if the heat-diffusion scale is made adaptive to sequence length rather than fixed.

Load-bearing premise

That casting attention features as heat diffusion plus small stochastic noise will reduce unwanted identity drift without blocking prompt-driven story changes or creating new visual artifacts.

What would settle it

Compare generated story sequences with and without the physics-informed attention on the same prompts and measure whether character attributes stay more consistent across frames while narrative elements such as poses and scene actions continue to vary.

Figures

Figures reproduced from arXiv: 2605.11927 by Guang Dai, Ivor Tsang, Jun Chen, Qi Zhao.

Figure 1
Figure 1. Figure 1: Our framework tunes the balance between storytelling dynamism (left column) and robust character coherence (right column). [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of RealDiffusion. Our framework incorporates Physics-informed Attention into a U-Net, governed by [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An 8-frame story sequence with its dynamic masks. Prompt: [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A qualitative comparison of our RealDiffusion against five state-of-the-art baseline models. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on physical priors. A high level of character coherence can be readily achieved by using heat diffusion as the coherence [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of α on the trade-off between coherence and dynamism. Prompt: Storybook watercolor, girl and bear, flying a kite, sharing honey, sitting on the moon, catching stars, sliding down a rainbow, and napping on a cloud [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Quantitative impact of the controller α on metrics. prior directly during inference. Heat diffusion acts as a dis￾sipative prior suppressing attribute drift and identity swap￾ping common in multi-character scenes. A complementary stochastic process preserves narrative dynamism allowing for natural character evolution and interaction. This dual system provides fine-grained control over the trade-off be￾twee… view at source ↗
read the original abstract

While modern diffusion models excel at generating diverse single images, extending this to sequential generation reveals a fundamental challenge: balancing narrative dynamism with multi-character coherence. Existing methods often falter at this trade-off, leading to artifacts where characters lose their identity or the story stagnates. To resolve this critical tension, we introduce RealDiffusion, a unified framework designed to reconcile robust coherence with narrative dynamism. Heat diffusion serves as a dissipative prior that averages neighboring features along the sequence and removes high-frequency noise within the subject region. This suppresses attribute drift and stabilizes identity across frames. A region-aware stochastic process then introduces small perturbations that explore nearby modes and prevent collapse so the story maintains pose change and scene evolution. We thus introduce a lightweight, training-free Physics-informed Attention mechanism that injects controllable physical priors into the self-attention layers during inference. By modeling feature evolution as a configurable physical system, our method regularizes spatio-temporal relationships without suppressing intentional, prompt-driven changes. Extensive experiments demonstrate that RealDiffusion achieves substantial gains in character coherence while preserving narrative dynamism, outperforming state-of-the-art approaches. Code is available at https://github.com/ShmilyQi-CN/RealDiffusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces RealDiffusion, a training-free framework that injects a Physics-informed Attention mechanism into the self-attention layers of diffusion models during inference for multi-character storybook generation. Heat diffusion acts as a dissipative prior that averages neighboring features to suppress attribute drift and stabilize identity, while a region-aware stochastic process adds small perturbations to explore modes and preserve prompt-driven pose and scene changes. The central claim is that this configurable physical system regularizes spatio-temporal relationships to achieve substantial gains in character coherence while maintaining narrative dynamism, outperforming state-of-the-art methods.

Significance. If the claimed balance holds under quantitative scrutiny, the work would demonstrate a lightweight, inference-only route for embedding external physical priors into generative models, offering a practical solution to the coherence-dynamism trade-off in sequential image synthesis without retraining. This could influence downstream applications in consistent character animation and story visualization.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'substantial gains in character coherence' and 'outperforming state-of-the-art approaches' is unsupported by any metrics, baselines, ablation results, or quantitative tables, which directly undermines verification of the central claim.
  2. [Method] Method description (physics-informed attention): no explicit equations, diffusion coefficients, noise schedules, or parameter values are supplied for the heat-diffusion term or stochastic perturbations, leaving the mechanism for balancing drift suppression against prompt-driven evolution unverified and load-bearing for the reported trade-off.
minor comments (1)
  1. [Abstract] Abstract: the GitHub link for code availability is a positive step toward reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to improve clarity and verifiability of our claims and method.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'substantial gains in character coherence' and 'outperforming state-of-the-art approaches' is unsupported by any metrics, baselines, ablation results, or quantitative tables, which directly undermines verification of the central claim.

    Authors: We agree that the abstract would be stronger with explicit quantitative support. The full manuscript contains detailed experiments (Section 4) with metrics for character coherence (e.g., identity preservation scores), narrative dynamism measures, baselines, and ablation tables demonstrating improvements over SOTA methods. We will revise the abstract to include key numerical results, such as average coherence gains and comparisons, to make the central claims immediately verifiable. revision: yes

  2. Referee: [Method] Method description (physics-informed attention): no explicit equations, diffusion coefficients, noise schedules, or parameter values are supplied for the heat-diffusion term or stochastic perturbations, leaving the mechanism for balancing drift suppression against prompt-driven evolution unverified and load-bearing for the reported trade-off.

    Authors: We appreciate this observation. The method section provides a conceptual description of the physics-informed attention, but we acknowledge the need for explicit formulations. We will add the governing equations for the heat diffusion term (including the diffusion coefficient and discretization scheme), the region-aware stochastic process with its noise schedule, and the specific hyperparameter values used in experiments to fully specify the balance between coherence and dynamism. revision: yes

Circularity Check

0 steps flagged

No circularity: physical priors introduced as external mechanism without reduction to inputs

full rationale

The paper presents RealDiffusion as a training-free inference-time injection of heat diffusion (as dissipative averaging) plus stochastic perturbations into self-attention layers. No equations, parameter fits, or derivations are shown that reduce the claimed coherence-dynamism balance to a self-defined quantity, fitted input renamed as prediction, or self-citation chain. The physical system is motivated as an external configurable prior rather than derived from the target result itself, and the abstract provides no self-referential steps that would force the outcome by construction. This is the common case of an independent modeling choice whose validity rests on empirical verification rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified assumption that heat diffusion and stochastic perturbations can be directly translated into attention-layer operations to control spatio-temporal feature evolution; no free parameters, additional axioms, or invented entities are explicitly listed in the abstract.

axioms (2)
  • domain assumption Heat diffusion acts as a dissipative prior that averages neighboring features and removes high-frequency noise within subject regions.
    Invoked to justify suppression of attribute drift and identity stabilization.
  • domain assumption A region-aware stochastic process can introduce perturbations that explore nearby modes without causing collapse.
    Invoked to maintain pose change and scene evolution.

pith-pipeline@v0.9.0 · 5504 in / 1369 out tokens · 153337 ms · 2026-05-13T06:20:54.297944+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 9 internal anchors

  1. [1]

    Align your latents: High-resolution video synthesis with la- tent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 2

  2. [2]

    Scientific machine learning through physics– informed neural networks: Where we are and what’s next

    Salvatore Cuomo, Vincenzo Schiano Di Cola, Fabio Gi- ampaolo, Gianluigi Rozza, Maziar Raissi, and Francesco Piccialli. Scientific machine learning through physics– informed neural networks: Where we are and what’s next. Journal of Scientific Computing, 92(3):88, 2022. 3

  3. [3]

    One-shot transfer learn- ing of physics-informed neural networks.arXiv preprint arXiv:2110.11286, 2021

    Shaan Desai, Marios Mattheakis, Hayden Joy, Pavlos Pro- topapas, and Stephen Roberts. One-shot transfer learn- ing of physics-informed neural networks.arXiv preprint arXiv:2110.11286, 2021. 3

  4. [4]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  5. [5]

    Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

  6. [6]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 2

  7. [7]

    Talecrafter: Interactive story visualization with multiple characters,

    Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, et al. Talecrafter: Interactive story visualization with multiple characters.arXiv preprint arXiv:2305.18247, 2023. 2

  8. [8]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning.arXiv preprint arXiv:2104.08718,

  9. [9]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 2

  10. [10]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  11. [11]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 2

  12. [12]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

  13. [13]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 2

  14. [14]

    Con- sistent story generation: Unlocking the potential of zigzag sampling

    Mingxiao Li, Mang Ning, and Marie-Francine Moens. Con- sistent story generation: Unlocking the potential of zigzag sampling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2, 7

  15. [15]

    Photomaker: Customizing re- alistic human photos via stacked id embedding

    Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8640–8650, 2024. 2, 7

  16. [16]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 2

  17. [17]

    One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt

    Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fa- had Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, and Ming-Ming Cheng. One-prompt-one-story: Free-lunch consistent text-to-image generation using a single prompt. arXiv preprint arXiv:2501.13554, 2025. 2, 7

  18. [18]

    Gpt-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025

    OpenAI. Gpt-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025. Accessed 2025- 11-12. 5

  19. [19]

    A threshold selection method from gray- level histograms.IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66, 1979

    Nobuyuki Otsu. A threshold selection method from gray- level histograms.IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66, 1979. 4

  20. [20]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  21. [21]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2

  22. [22]

    Meta-learning pinn loss functions.Journal of com- putational physics, 458:111121, 2022

    Apostolos F Psaros, Kenji Kawaguchi, and George Em Kar- niadakis. Meta-learning pinn loss functions.Journal of com- putational physics, 458:111121, 2022. 3

  23. [23]

    Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning frame- work for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computa- tional physics, 378:686–707, 2019. 2

  24. [24]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  25. [25]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 2

  26. [26]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2

  27. [27]

    Low-rank adaptation for fast text-to-image diffu- sion fine-tuning.Low-rank adaptation for fast text-to-image diffusion fine-tuning, 3, 2023

    Simo Ryu. Low-rank adaptation for fast text-to-image diffu- sion fine-tuning.Low-rank adaptation for fast text-to-image diffusion fine-tuning, 3, 2023. 2

  28. [28]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2

  29. [29]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

  30. [30]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. pmlr, 2015. 2

  31. [31]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 5

  32. [32]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2

  33. [33]

    Training-free consis- tent text-to-image generation.ACM Transactions on Graph- ics (TOG), 43(4):1–18, 2024

    Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consis- tent text-to-image generation.ACM Transactions on Graph- ics (TOG), 43(4):1–18, 2024. 2, 7

  34. [34]

    Adap- tive physics-informed neural networks: A survey.arXiv preprint arXiv:2503.18181, 2025

    Edgar Torres, Jonathan Schiefer, and Mathias Niepert. Adap- tive physics-informed neural networks: A survey.arXiv preprint arXiv:2503.18181, 2025. 3

  35. [35]

    Characonsist: Fine- grained consistent character generation

    Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, and Yunchao Wei. Characonsist: Fine- grained consistent character generation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 16058–16067, 2025. 2, 7

  36. [36]

    Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 2

  37. [37]

    Fastcomposer: Tuning-free multi- subject image generation with localized attention.Interna- tional Journal of Computer Vision, 133(3):1175–1194, 2025

    Guangxuan Xiao, Tianwei Yin, William T Freeman, Fr ´edo Durand, and Song Han. Fastcomposer: Tuning-free multi- subject image generation with localized attention.Interna- tional Journal of Computer Vision, 133(3):1175–1194, 2025. 2

  38. [38]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

  39. [39]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2

  40. [40]

    Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2024

    Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2024. 2, 7