pith. machine review for the scientific record. sign in

arxiv: 2604.04335 · v2 · submitted 2026-04-06 · 💻 cs.DC

Recognition: 2 theorem links

· Lean Theorem

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:17 UTC · model grok-4.3

classification 💻 cs.DC
keywords diffusion modelsco-servingSLO attainmentGPU clusterstext-to-imagetext-to-videopreemptionresource scheduling
0
0 comments X

The pith

By exploiting the preemptible steps in diffusion inference, a co-serving system improves SLO attainment for mixed text-to-image and text-to-video workloads by up to 44 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current serving systems incur many SLO violations when text-to-image and text-to-video diffusion requests run together on shared GPUs, because the two modalities differ sharply in compute cost, parallelism, and latency targets. It shows that diffusion inference consists of discrete, predictable steps that can be stopped and restarted at those boundaries. GENSERVE therefore applies three linked mechanisms: intelligent preemption of video requests, elastic sequence parallelism combined with dynamic batching, and an SLO-aware scheduler that reallocates resources across all active jobs. These changes let the system interleave the workloads more efficiently than prior approaches. A reader would care because production platforms increasingly need to host both modalities on the same hardware without building separate clusters or missing latency targets.

Core claim

Diffusion inference advances through a fixed sequence of denoising steps that remain interruptible at each boundary. GENSERVE uses this property to coordinate three mechanisms: it preempts longer video generations at step boundaries when shorter image requests need resources, it adjusts sequence parallelism and batch sizes on the fly to match current workload mix, and it runs an SLO-aware scheduler that jointly decides allocation for every concurrent request. The result is a measurable rise in the fraction of requests that meet their latency targets, reaching gains of up to 44 percent over the strongest baseline across tested configurations.

What carries the argument

step-level resource adaptation, which treats each diffusion denoising step as a natural preemption point and uses that granularity to drive intelligent video preemption, elastic sequence parallelism with dynamic batching, and joint SLO-aware scheduling across heterogeneous requests.

If this is right

  • Shared GPU clusters can host both text-to-image and text-to-video requests without dedicated partitions for each modality.
  • The fraction of requests meeting latency SLOs rises by up to 44 percent relative to prior co-serving baselines.
  • Resource allocation across concurrent requests becomes jointly optimized rather than handled independently per modality.
  • Dynamic adjustments to parallelism and batch size can track changes in the mix of short and long requests in real time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same step-boundary preemption idea could extend to other iterative generative models that also advance in fixed, interruptible stages.
  • Cluster operators might reduce total GPU count needed to support both modalities at a given SLO target.
  • Further scheduling policies could incorporate additional signals such as remaining step count or user priority once the basic preemptibility is established.

Load-bearing premise

The diffusion process can be stopped and resumed at step boundaries with no hidden overheads or quality penalties that would cancel the gains from the three mechanisms.

What would settle it

Measure SLO attainment rates on a mixed T2I and T2V workload while forcing all preemption to occur only at full generation completion instead of at step boundaries; if the 44 percent gain vanishes, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.04335 by Arvind Krishnamurthy, Desen Sun, Ethan Ma, Fanjiang Ye, Jingwei Zuo, Kaijian Wang, Myungjin Lee, Russell Chen, Triston Cao, Xinrui Zhong, Ye Cao, Yuke Wang, Zhangke Li.

Figure 1
Figure 1. Figure 1: Serving 4 videos (V1–V4) and 3 images (I1–I3) on [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DiT inference process. GPUs for tens of minutes, orders of magnitude longer than an image. This asymmetry creates a core tension for mixed-workload serving. Unlike LLMs, whose KV caches can consume the majority of GPU memory, DiT models have a relatively com￾pact memory footprint, e.g., a single GPU can host both a T2I and a T2V model weights simultaneously, with a peak memory around 50 Gigabyt… view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end latency of T2I and T2V workloads [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Head-of-line blocking under FCFS scheduling with [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Communication overhead (as % of per-step time) [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: The runtime of different stages in T2V across differ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: System overview of GENSERVE. Incoming requests are admitted by the Scheduler, which maintains cluster states and queries the Profiler for offline latency estimates to invoke the Solver for joint optimization. The resulting decisions are passed to the Allocator and the Controller, which dispatches tasks and preemption signals to Workers. Workers execute inference on assigned GPUs and report step-level progr… view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of intelligent video preemption. The [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Adaptive resource allocation on the cluster serv [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Number of requests meeting SLO versus SLO [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overall SAR (%) versus arrival rate (12– [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CDF of per-request turnaround latency at the de [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Ablation study under the skewed resolution setting [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
read the original abstract

Diffusion models have emerged as the prevailing approach for text-to-image (T2I) and text-to-video (T2V) generation, yet production platforms must increasingly serve both modalities on shared GPU clusters while meeting stringent latency SLOs. Co-serving such heterogeneous workloads is challenging: T2I and T2V requests exhibit vastly different compute demands, parallelism characteristics, and latency requirements, leading to significant SLO violations in existing serving systems. We present GENSERVE, a co-serving system that leverages the inherent predictability of the diffusion process to optimize serving efficiency. A central insight is that diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries, opening a new design space for heterogeneity-aware resource management. GENSERVE introduces step-level resource adaptation through three coordinated mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes resource allocation across all concurrent requests. Experimental results show that GENSERVE improves the SLO attainment rate by up to 44% over the strongest baseline across diverse configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents GENSERVE, a co-serving system for heterogeneous diffusion model workloads on shared GPU clusters, focusing on text-to-image (T2I) and text-to-video (T2V) generation. It exploits the discrete and naturally preemptible steps of diffusion inference to enable step-level resource adaptation via three mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes allocations. The central experimental claim is an improvement of up to 44% in SLO attainment rate over the strongest baseline across diverse configurations.

Significance. If the performance claims hold under detailed scrutiny, the work would represent a practical contribution to systems for serving generative AI models. Handling mixed T2I/T2V workloads with strict latency SLOs is a timely problem as production platforms scale these models together; the step-boundary preemption insight and coordinated mechanisms offer a concrete design that could inform future heterogeneity-aware serving frameworks. The paper ships a systems artifact with measured outcomes rather than purely theoretical analysis, which strengthens its utility.

major comments (2)
  1. Abstract: the claim of up to 44% SLO improvement is presented without any information on the baselines compared against, the workload traces or request mixes used, the hardware platform, or statistical significance of the results. This absence directly undermines assessment of whether the central performance claim is load-bearing or reproducible.
  2. Description of video preemption mechanism: the assumption that diffusion steps are low-overhead preemption points is load-bearing for the 44% gain, yet no measurements are provided of state-save/restore costs (activation tensors, KV-cache, noise state) for T2V requests, which have substantially larger memory footprints than T2I. Without bounding these costs under mixed workloads, it remains unclear whether the scheduler's joint optimization delivers net positive SLO gains or if bandwidth and context-switch overheads erode them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for identifying areas where additional context and quantification would strengthen the manuscript. We address both major comments below and will incorporate revisions to improve clarity and completeness of the performance claims.

read point-by-point responses
  1. Referee: Abstract: the claim of up to 44% SLO improvement is presented without any information on the baselines compared against, the workload traces or request mixes used, the hardware platform, or statistical significance of the results. This absence directly undermines assessment of whether the central performance claim is load-bearing or reproducible.

    Authors: We agree that the abstract would benefit from additional context to make the central claim more transparent. In the revised manuscript we will expand the abstract to briefly specify the baselines (static allocation and prior diffusion serving systems), the workload traces (synthetic and real heterogeneous T2I/T2V mixes with varying request ratios and arrival rates), the hardware platform (NVIDIA A100 GPU cluster), and note that the 44% figure represents the maximum improvement observed across configurations with results averaged over multiple runs. revision: yes

  2. Referee: Description of video preemption mechanism: the assumption that diffusion steps are low-overhead preemption points is load-bearing for the 44% gain, yet no measurements are provided of state-save/restore costs (activation tensors, KV-cache, noise state) for T2V requests, which have substantially larger memory footprints than T2I. Without bounding these costs under mixed workloads, it remains unclear whether the scheduler's joint optimization delivers net positive SLO gains or if bandwidth and context-switch overheads erode them.

    Authors: This is a valid point; the net benefit of step-level preemption depends on the overhead being small relative to per-step compute. The current manuscript does not include quantitative microbenchmarks of save/restore latency or bandwidth for T2V state under mixed workloads. We will add these measurements in the revised version (new subsection in Evaluation or Appendix), reporting the time to checkpoint and restore noise tensors plus activations for representative T2V sequence lengths, both in isolation and when co-located with T2I requests. We will then show that the overhead remains a small fraction of step execution time and does not erode the reported SLO gains, or adjust the claims and scheduler if the data indicate otherwise. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical systems paper with measured results

full rationale

The paper presents GENSERVE as a systems artifact for co-serving T2I and T2V diffusion workloads. Its core claim (up to 44% SLO improvement) rests on experimental measurements of three mechanisms (video preemption, elastic sequence parallelism with dynamic batching, and SLO-aware scheduler) rather than any derivation, equation, or fitted parameter. The abstract and description state the preemptibility insight as a design premise but do not reduce any prediction or result to that premise by construction. No self-citations, uniqueness theorems, or ansatzes appear as load-bearing steps. The work is self-contained against external benchmarks via reported outcomes; no step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that diffusion steps are discrete and preemptible; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries.
    Stated as the central insight enabling the three mechanisms.

pith-pipeline@v0.9.0 · 5525 in / 1224 out tokens · 61298 ms · 2026-05-10T20:17:43.936461+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 24 canonical work pages · 14 internal anchors

  1. [1]

    Approximate caching for efficiently serving diffusion models.arXiv preprint arXiv:2312.04429, 2023

    Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv Saini. Approximate caching for efficiently serving diffusion models.arXiv preprint arXiv:2312.04429, 2023

  2. [2]

    Approximate caching for efficiently serving {Text-to-Image} diffusion models

    Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv Ku- mar Saini. Approximate caching for efficiently serving {Text-to-Image} diffusion models. In21st USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 24), pages 1173–1189, 2024

  3. [3]

    Diffserve: Efficiently serving text-to-image diffusion models with query-aware model scaling.Proceedings of Machine Learning and Systems, 7, 2025

    Sohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K Sitaraman, and Hui Guan. Diffserve: Efficiently serving text-to-image diffusion models with query-aware model scaling.Proceedings of Machine Learning and Systems, 7, 2025

  4. [4]

    Epd-serve: A flexible multimodal epd disaggregation inference serving system on ascend

    Fan Bai, Pai Peng, Zhengzhi Tang, Zhe Wang, Gong Chen, Xiang Lu, Yinuo Li, Huan Lin, Weizhe Lin, Yaoyuan Wang, et al. Epd-serve: A flexible multimodal epd disaggregation inference serving system on ascend. arXiv preprint arXiv:2601.11590, 2026

  5. [5]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  7. [7]

    Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models

    Jae-Won Chung, Jeff J Ma, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, and Mosharaf Chowdhury. Cornserve: A distributed serving system for any-to-any multimodal models.arXiv preprint arXiv:2603.12118, 2026

  8. [8]

    Clipper: A {Low-Latency} online prediction serving system

    Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. Clipper: A {Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017

  9. [9]

    Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

  10. [10]

    Scaling recti- fied flow transformers for high-resolution image synthe- sis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthe- sis. InForty-first international conference on machine learning, 2024

  11. [11]

    PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

    Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and Jian- nan Wang. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv preprint arXiv:2405.14430, 2024

  12. [12]

    xdit: an inference engine for diffusion transformers (dits) with massive parallelism,

    Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion trans- formers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024

  13. [13]

    Denois- ing diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denois- ing diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  14. [14]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text- to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

  15. [15]

    Ddit: Dynamic resource allocation for diffusion transformer model serv- ing.arXiv preprint arXiv:2506.13497, 2025

    Heyang Huang, Cunchen Hu, Jiaqi Zhu, Ziyuan Gao, Liangliang Xu, Yizhou Shan, Yungang Bao, Sun Ninghui, Tianwei Zhang, and Sa Wang. Ddit: Dynamic resource allocation for diffusion transformer model serv- ing.arXiv preprint arXiv:2506.13497, 2025

  16. [16]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on 13 Computer Vision and Pattern Recognition, pages 21807– 21818, 2024

  17. [17]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajb- handari, and Yuxiong He. Deepspeed ulysses: Sys- tem optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

  18. [18]

    Algorithms for hybrid milp/cp models for a class of optimization prob- lems.INFORMS Journal on computing, 13(4):258–276, 2001

    Vipul Jain and Ignacio E Grossmann. Algorithms for hybrid milp/cp models for a class of optimization prob- lems.INFORMS Journal on computing, 13(4):258–276, 2001

  19. [19]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  20. [20]

    Efficient memory manage- ment for large language model serving with pagedatten- tion

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  21. [21]

    Distrifusion: Distributed parallel inference for high-resolution diffusion models

    Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7183–7193, 2024

  22. [22]

    {AlpaServe}: Sta- tistical multiplexing with model parallelism for deep learning serving

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Sta- tistical multiplexing with model parallelism for deep learning serving. In17th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 23), pages 663–679, 2023

  23. [23]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

  24. [24]

    Tetriserve: Efficiently serving mixed dit work- loads

    Runyu Lu, Shiqi He, Wenxuan Tan, Shenggui Li, Ruo- fan Wu, Jeff J Ma, Ang Chen, and Mosharaf Chowd- hury. Tetriserve: Efficiently serving mixed dit work- loads. 2026

  25. [25]

    Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

    Jeff J Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, and Mosharaf Chowdhury. Cornserve: Efficiently serving any-to-any multimodal models.arXiv preprint arXiv:2512.14098, 2025

  26. [26]

    Scalable diffu- sion models with transformers

    William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  27. [27]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  28. [28]

    Modserve: Modality-and stage-aware resource disaggregation for scalable multimodal model serving

    Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mo- han, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, et al. Modserve: Modality-and stage-aware resource disaggregation for scalable multimodal model serving. InProceedings of the 2025 ACM Symposium on Cloud Computing, pages 817–830, 2025

  29. [29]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  30. [30]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  31. [31]

    Photorealistic text-to-image diffusion mod- els with deep language understanding.Advances in neu- ral information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, et al. Photorealistic text-to-image diffusion mod- els with deep language understanding.Advances in neu- ral information processing systems, 35:36479–36494, 2022

  32. [32]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  33. [33]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  34. [34]

    Flex- cache: Flexible approximate cache system for video diffusion.arXiv preprint arXiv:2501.04012, 2024

    Desen Sun, Henry Tian, Tim Lu, and Sihang Liu. Flex- cache: Flexible approximate cache system for video diffusion.arXiv preprint arXiv:2501.04012, 2024

  35. [35]

    Mixfusion: A patch-level parallel serving system for mixed-resolution diffusion models

    Desen Sun, Zepeng Zhao, and Yuke Wang. Mixfusion: A patch-level parallel serving system for mixed-resolution diffusion models. InProceedings of the 31st ACM SIG- PLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 522–536, 2026. 14

  36. [36]

    Diffusers: State- of-the-art diffusion models

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pe- dro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State- of-the-art diffusion models. https://github.com/ huggingface/diffusers, 2022

  37. [37]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  38. [38]

    Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models

    Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. InProceedings of the 61st annual meeting of the association for com- putational linguistics (volume 1: Long papers), pages 893–911, 2023

  39. [39]

    Tridentserve: A stage-level serving system for diffusion pipelines.arXiv preprint arXiv:2510.02838, 2025

    Yifei Xia, Fangcheng Fu, Hao Yuan, Hanke Zhang, Xu- peng Miao, Yijun Liu, Suhan Ling, Jie Jiang, and Bin Cui. Tridentserve: A stage-level serving system for diffusion pipelines.arXiv preprint arXiv:2510.02838, 2025

  40. [40]

    Modm: Efficient serving for image generation via m ixture-o f-d iffusion m odels

    Yuchen Xia, Divyam Sharma, Yichao Yuan, Souvik Kundu, and Nishil Talati. Modm: Efficient serving for image generation via m ixture-o f-d iffusion m odels. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 163–182, 2026

  41. [41]

    Streamfusion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

    Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, and Gennady Pekhimenko. Streamfusion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

  42. [42]

    Supergen: An efficient ultra-high-resolution video generation system with sketching and tiling,

    Fanjiang Ye, Zepeng Zhao, Yi Mu, Jucheng Shen, Ren- jie Li, Kaijian Wang, Saurabh Agarwal, Myungjin Lee, Triston Cao, Aditya Akella, et al. Supergen: An effi- cient ultra-high-resolution video generation system with sketching and tiling.arXiv preprint arXiv:2508.17756, 2025

  43. [43]

    vllm-omni: Fully disaggregated serving for any-to-any multimodal models.arXiv preprint arXiv:2602.02204, 2026

    Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal mod- els.arXiv preprint arXiv:2602.02204, 2026

  44. [44]

    Orca: A distributed serving system for {Transformer-Based} generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating sys- tems design and implementation (OSDI 22), pages 521– 538, 2022

  45. [45]

    Open-sora 2.0: Training a commercial-level video generation model in $200 k

    Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025

  46. [46]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chen- hui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratiz- ing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 15