arxiv: 2604.04335 · v2 · submitted 2026-04-06 · 💻 cs.DC

Recognition: 2 theorem links

· Lean Theorem

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

Fanjiang Ye , Zhangke Li , Xinrui Zhong , Ethan Ma , Russell Chen , Kaijian Wang , Jingwei Zuo , Desen Sun

show 5 more authors

Ye Cao Triston Cao Myungjin Lee Arvind Krishnamurthy Yuke Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:17 UTC · model grok-4.3

classification 💻 cs.DC

keywords diffusion modelsco-servingSLO attainmentGPU clusterstext-to-imagetext-to-videopreemptionresource scheduling

0 comments

The pith

By exploiting the preemptible steps in diffusion inference, a co-serving system improves SLO attainment for mixed text-to-image and text-to-video workloads by up to 44 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current serving systems incur many SLO violations when text-to-image and text-to-video diffusion requests run together on shared GPUs, because the two modalities differ sharply in compute cost, parallelism, and latency targets. It shows that diffusion inference consists of discrete, predictable steps that can be stopped and restarted at those boundaries. GENSERVE therefore applies three linked mechanisms: intelligent preemption of video requests, elastic sequence parallelism combined with dynamic batching, and an SLO-aware scheduler that reallocates resources across all active jobs. These changes let the system interleave the workloads more efficiently than prior approaches. A reader would care because production platforms increasingly need to host both modalities on the same hardware without building separate clusters or missing latency targets.

Core claim

Diffusion inference advances through a fixed sequence of denoising steps that remain interruptible at each boundary. GENSERVE uses this property to coordinate three mechanisms: it preempts longer video generations at step boundaries when shorter image requests need resources, it adjusts sequence parallelism and batch sizes on the fly to match current workload mix, and it runs an SLO-aware scheduler that jointly decides allocation for every concurrent request. The result is a measurable rise in the fraction of requests that meet their latency targets, reaching gains of up to 44 percent over the strongest baseline across tested configurations.

What carries the argument

step-level resource adaptation, which treats each diffusion denoising step as a natural preemption point and uses that granularity to drive intelligent video preemption, elastic sequence parallelism with dynamic batching, and joint SLO-aware scheduling across heterogeneous requests.

If this is right

Shared GPU clusters can host both text-to-image and text-to-video requests without dedicated partitions for each modality.
The fraction of requests meeting latency SLOs rises by up to 44 percent relative to prior co-serving baselines.
Resource allocation across concurrent requests becomes jointly optimized rather than handled independently per modality.
Dynamic adjustments to parallelism and batch size can track changes in the mix of short and long requests in real time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same step-boundary preemption idea could extend to other iterative generative models that also advance in fixed, interruptible stages.
Cluster operators might reduce total GPU count needed to support both modalities at a given SLO target.
Further scheduling policies could incorporate additional signals such as remaining step count or user priority once the basic preemptibility is established.

Load-bearing premise

The diffusion process can be stopped and resumed at step boundaries with no hidden overheads or quality penalties that would cancel the gains from the three mechanisms.

What would settle it

Measure SLO attainment rates on a mixed T2I and T2V workload while forcing all preemption to occur only at full generation completion instead of at step boundaries; if the 44 percent gain vanishes, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.04335 by Arvind Krishnamurthy, Desen Sun, Ethan Ma, Fanjiang Ye, Jingwei Zuo, Kaijian Wang, Myungjin Lee, Russell Chen, Triston Cao, Xinrui Zhong, Ye Cao, Yuke Wang, Zhangke Li.

**Figure 2.** Figure 2: Overview of DiT inference process. GPUs for tens of minutes, orders of magnitude longer than an image. This asymmetry creates a core tension for mixed-workload serving. Unlike LLMs, whose KV caches can consume the majority of GPU memory, DiT models have a relatively compact memory footprint, e.g., a single GPU can host both a T2I and a T2V model weights simultaneously, with a peak memory around 50 Gigabyt… view at source ↗

**Figure 3.** Figure 3: End-to-end latency of T2I and T2V workloads [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Head-of-line blocking under FCFS scheduling with [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Communication overhead (as % of per-step time) [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 5.** Figure 5: The runtime of different stages in T2V across differ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: System overview of GENSERVE. Incoming requests are admitted by the Scheduler, which maintains cluster states and queries the Profiler for offline latency estimates to invoke the Solver for joint optimization. The resulting decisions are passed to the Allocator and the Controller, which dispatches tasks and preemption signals to Workers. Workers execute inference on assigned GPUs and report step-level progr… view at source ↗

**Figure 8.** Figure 8: Illustration of intelligent video preemption. The [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Adaptive resource allocation on the cluster serv [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Number of requests meeting SLO versus SLO [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 12.** Figure 12: Overall SAR (%) versus arrival rate (12– [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: CDF of per-request turnaround latency at the de [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 14.** Figure 14: Ablation study under the skewed resolution setting [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

read the original abstract

Diffusion models have emerged as the prevailing approach for text-to-image (T2I) and text-to-video (T2V) generation, yet production platforms must increasingly serve both modalities on shared GPU clusters while meeting stringent latency SLOs. Co-serving such heterogeneous workloads is challenging: T2I and T2V requests exhibit vastly different compute demands, parallelism characteristics, and latency requirements, leading to significant SLO violations in existing serving systems. We present GENSERVE, a co-serving system that leverages the inherent predictability of the diffusion process to optimize serving efficiency. A central insight is that diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries, opening a new design space for heterogeneity-aware resource management. GENSERVE introduces step-level resource adaptation through three coordinated mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes resource allocation across all concurrent requests. Experimental results show that GENSERVE improves the SLO attainment rate by up to 44% over the strongest baseline across diverse configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GENSERVE uses diffusion step boundaries for preemption and dynamic allocation in mixed T2I/T2V serving, but the 44% SLO claim lacks enough experimental detail to evaluate yet.

read the letter

The main point is that this paper describes a serving system that treats diffusion inference as a sequence of discrete, preemptible steps so it can mix text-to-image and text-to-video requests on the same GPUs without blowing up latency SLOs. It proposes three mechanisms: video preemption at step boundaries, elastic sequence parallelism plus dynamic batching, and a scheduler that allocates resources across all running requests at once. The abstract reports up to 44% better SLO attainment than the strongest baseline in various configurations, which would matter for anyone running these models at scale. The work does a reasonable job identifying the real production mismatch between the two modalities and showing how the step-wise structure creates a new scheduling opportunity that black-box serving systems miss. That part feels grounded in how diffusion actually runs. The soft spots are in the evidence. The abstract gives no information on the baselines, the workload traces, the hardware, or how they measured statistical significance, so it is impossible to tell whether the gains come from the new mechanisms or from something else in the setup. The stress-test concern about state-save and restore costs for the much larger T2V activation tensors is worth checking directly; if those memory-bandwidth and context-switch costs grow with sequence length, they could shrink or erase the reported net improvement. The paper would be stronger with explicit overhead measurements and more complete experimental reporting. This is aimed at people who build or study ML inference serving systems. A reader working on heterogeneous generative workloads could pick up usable ideas from the architecture even if they treat the numbers as preliminary. It deserves peer review so referees can examine the full implementation, traces, and measurements.

Referee Report

2 major / 0 minor

Summary. The manuscript presents GENSERVE, a co-serving system for heterogeneous diffusion model workloads on shared GPU clusters, focusing on text-to-image (T2I) and text-to-video (T2V) generation. It exploits the discrete and naturally preemptible steps of diffusion inference to enable step-level resource adaptation via three mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes allocations. The central experimental claim is an improvement of up to 44% in SLO attainment rate over the strongest baseline across diverse configurations.

Significance. If the performance claims hold under detailed scrutiny, the work would represent a practical contribution to systems for serving generative AI models. Handling mixed T2I/T2V workloads with strict latency SLOs is a timely problem as production platforms scale these models together; the step-boundary preemption insight and coordinated mechanisms offer a concrete design that could inform future heterogeneity-aware serving frameworks. The paper ships a systems artifact with measured outcomes rather than purely theoretical analysis, which strengthens its utility.

major comments (2)

Abstract: the claim of up to 44% SLO improvement is presented without any information on the baselines compared against, the workload traces or request mixes used, the hardware platform, or statistical significance of the results. This absence directly undermines assessment of whether the central performance claim is load-bearing or reproducible.
Description of video preemption mechanism: the assumption that diffusion steps are low-overhead preemption points is load-bearing for the 44% gain, yet no measurements are provided of state-save/restore costs (activation tensors, KV-cache, noise state) for T2V requests, which have substantially larger memory footprints than T2I. Without bounding these costs under mixed workloads, it remains unclear whether the scheduler's joint optimization delivers net positive SLO gains or if bandwidth and context-switch overheads erode them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for identifying areas where additional context and quantification would strengthen the manuscript. We address both major comments below and will incorporate revisions to improve clarity and completeness of the performance claims.

read point-by-point responses

Referee: Abstract: the claim of up to 44% SLO improvement is presented without any information on the baselines compared against, the workload traces or request mixes used, the hardware platform, or statistical significance of the results. This absence directly undermines assessment of whether the central performance claim is load-bearing or reproducible.

Authors: We agree that the abstract would benefit from additional context to make the central claim more transparent. In the revised manuscript we will expand the abstract to briefly specify the baselines (static allocation and prior diffusion serving systems), the workload traces (synthetic and real heterogeneous T2I/T2V mixes with varying request ratios and arrival rates), the hardware platform (NVIDIA A100 GPU cluster), and note that the 44% figure represents the maximum improvement observed across configurations with results averaged over multiple runs. revision: yes
Referee: Description of video preemption mechanism: the assumption that diffusion steps are low-overhead preemption points is load-bearing for the 44% gain, yet no measurements are provided of state-save/restore costs (activation tensors, KV-cache, noise state) for T2V requests, which have substantially larger memory footprints than T2I. Without bounding these costs under mixed workloads, it remains unclear whether the scheduler's joint optimization delivers net positive SLO gains or if bandwidth and context-switch overheads erode them.

Authors: This is a valid point; the net benefit of step-level preemption depends on the overhead being small relative to per-step compute. The current manuscript does not include quantitative microbenchmarks of save/restore latency or bandwidth for T2V state under mixed workloads. We will add these measurements in the revised version (new subsection in Evaluation or Appendix), reporting the time to checkpoint and restore noise tensors plus activations for representative T2V sequence lengths, both in isolation and when co-located with T2I requests. We will then show that the overhead remains a small fraction of step execution time and does not erode the reported SLO gains, or adjust the claims and scheduler if the data indicate otherwise. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical systems paper with measured results

full rationale

The paper presents GENSERVE as a systems artifact for co-serving T2I and T2V diffusion workloads. Its core claim (up to 44% SLO improvement) rests on experimental measurements of three mechanisms (video preemption, elastic sequence parallelism with dynamic batching, and SLO-aware scheduler) rather than any derivation, equation, or fitted parameter. The abstract and description state the preemptibility insight as a design premise but do not reduce any prediction or result to that premise by construction. No self-citations, uniqueness theorems, or ansatzes appear as load-bearing steps. The work is self-contained against external benchmarks via reported outcomes; no step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that diffusion steps are discrete and preemptible; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries.
Stated as the central insight enabling the three mechanisms.

pith-pipeline@v0.9.0 · 5525 in / 1224 out tokens · 61298 ms · 2026-05-10T20:17:43.936461+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries... per-step runtime remains highly stable across batch sizes and SP degrees (CV < 0.05%)
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

intelligent video preemption... SLO-aware DP scheduler that jointly optimizes resource allocation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 24 canonical work pages · 14 internal anchors

[1]

Approximate caching for efficiently serving diffusion models.arXiv preprint arXiv:2312.04429, 2023

Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv Saini. Approximate caching for efficiently serving diffusion models.arXiv preprint arXiv:2312.04429, 2023

work page arXiv 2023
[2]

Approximate caching for efficiently serving {Text-to-Image} diffusion models

Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv Ku- mar Saini. Approximate caching for efficiently serving {Text-to-Image} diffusion models. In21st USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 24), pages 1173–1189, 2024

2024
[3]

Diffserve: Efficiently serving text-to-image diffusion models with query-aware model scaling.Proceedings of Machine Learning and Systems, 7, 2025

Sohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K Sitaraman, and Hui Guan. Diffserve: Efficiently serving text-to-image diffusion models with query-aware model scaling.Proceedings of Machine Learning and Systems, 7, 2025

2025
[4]

Epd-serve: A flexible multimodal epd disaggregation inference serving system on ascend

Fan Bai, Pai Peng, Zhengzhi Tang, Zhe Wang, Gong Chen, Xiang Lu, Yinuo Li, Huan Lin, Weizhe Lin, Yaoyuan Wang, et al. Epd-serve: A flexible multimodal epd disaggregation inference serving system on ascend. arXiv preprint arXiv:2601.11590, 2026

work page arXiv 2026
[5]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

2023
[6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[7]

Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models

Jae-Won Chung, Jeff J Ma, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, and Mosharaf Chowdhury. Cornserve: A distributed serving system for any-to-any multimodal models.arXiv preprint arXiv:2603.12118, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Clipper: A {Low-Latency} online prediction serving system

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. Clipper: A {Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017

2017
[9]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021
[10]

Scaling recti- fied flow transformers for high-resolution image synthe- sis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthe- sis. InForty-first international conference on machine learning, 2024

2024
[11]

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and Jian- nan Wang. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv preprint arXiv:2405.14430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

xdit: an inference engine for diffusion transformers (dits) with massive parallelism,

Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion trans- formers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024

work page arXiv 2024
[13]

Denois- ing diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denois- ing diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[14]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text- to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review arXiv 2022
[15]

Ddit: Dynamic resource allocation for diffusion transformer model serv- ing.arXiv preprint arXiv:2506.13497, 2025

Heyang Huang, Cunchen Hu, Jiaqi Zhu, Ziyuan Gao, Liangliang Xu, Yizhou Shan, Yungang Bao, Sun Ninghui, Tianwei Zhang, and Sa Wang. Ddit: Dynamic resource allocation for diffusion transformer model serv- ing.arXiv preprint arXiv:2506.13497, 2025

work page arXiv 2025
[16]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on 13 Computer Vision and Pattern Recognition, pages 21807– 21818, 2024

2024
[17]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajb- handari, and Yuxiong He. Deepspeed ulysses: Sys- tem optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review arXiv 2023
[18]

Algorithms for hybrid milp/cp models for a class of optimization prob- lems.INFORMS Journal on computing, 13(4):258–276, 2001

Vipul Jain and Ignacio E Grossmann. Algorithms for hybrid milp/cp models for a class of optimization prob- lems.INFORMS Journal on computing, 13(4):258–276, 2001

2001
[19]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Efficient memory manage- ment for large language model serving with pagedatten- tion

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[21]

Distrifusion: Distributed parallel inference for high-resolution diffusion models

Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7183–7193, 2024

2024
[22]

{AlpaServe}: Sta- tistical multiplexing with model parallelism for deep learning serving

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Sta- tistical multiplexing with model parallelism for deep learning serving. In17th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 23), pages 663–679, 2023

2023
[23]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review arXiv 2023
[24]

Tetriserve: Efficiently serving mixed dit work- loads

Runyu Lu, Shiqi He, Wenxuan Tan, Shenggui Li, Ruo- fan Wu, Jeff J Ma, Ang Chen, and Mosharaf Chowd- hury. Tetriserve: Efficiently serving mixed dit work- loads. 2026

2026
[25]

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

Jeff J Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, and Mosharaf Chowdhury. Cornserve: Efficiently serving any-to-any multimodal models.arXiv preprint arXiv:2512.14098, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Scalable diffu- sion models with transformers

William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[27]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Modserve: Modality-and stage-aware resource disaggregation for scalable multimodal model serving

Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mo- han, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, et al. Modserve: Modality-and stage-aware resource disaggregation for scalable multimodal model serving. InProceedings of the 2025 ACM Symposium on Cloud Computing, pages 817–830, 2025

2025
[29]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[31]

Photorealistic text-to-image diffusion mod- els with deep language understanding.Advances in neu- ral information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, et al. Photorealistic text-to-image diffusion mod- els with deep language understanding.Advances in neu- ral information processing systems, 35:36479–36494, 2022

2022
[32]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review arXiv 2022
[33]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[34]

Flex- cache: Flexible approximate cache system for video diffusion.arXiv preprint arXiv:2501.04012, 2024

Desen Sun, Henry Tian, Tim Lu, and Sihang Liu. Flex- cache: Flexible approximate cache system for video diffusion.arXiv preprint arXiv:2501.04012, 2024

work page arXiv 2024
[35]

Mixfusion: A patch-level parallel serving system for mixed-resolution diffusion models

Desen Sun, Zepeng Zhao, and Yuke Wang. Mixfusion: A patch-level parallel serving system for mixed-resolution diffusion models. InProceedings of the 31st ACM SIG- PLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 522–536, 2026. 14

2026
[36]

Diffusers: State- of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pe- dro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State- of-the-art diffusion models. https://github.com/ huggingface/diffusers, 2022

2022
[37]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models

Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. InProceedings of the 61st annual meeting of the association for com- putational linguistics (volume 1: Long papers), pages 893–911, 2023

2023
[39]

Tridentserve: A stage-level serving system for diffusion pipelines.arXiv preprint arXiv:2510.02838, 2025

Yifei Xia, Fangcheng Fu, Hao Yuan, Hanke Zhang, Xu- peng Miao, Yijun Liu, Suhan Ling, Jie Jiang, and Bin Cui. Tridentserve: A stage-level serving system for diffusion pipelines.arXiv preprint arXiv:2510.02838, 2025

work page arXiv 2025
[40]

Modm: Efficient serving for image generation via m ixture-o f-d iffusion m odels

Yuchen Xia, Divyam Sharma, Yichao Yuan, Souvik Kundu, and Nishil Talati. Modm: Efficient serving for image generation via m ixture-o f-d iffusion m odels. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 163–182, 2026

2026
[41]

Streamfusion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, and Gennady Pekhimenko. Streamfusion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

work page arXiv 2026
[42]

Supergen: An efficient ultra-high-resolution video generation system with sketching and tiling,

Fanjiang Ye, Zepeng Zhao, Yi Mu, Jucheng Shen, Ren- jie Li, Kaijian Wang, Saurabh Agarwal, Myungjin Lee, Triston Cao, Aditya Akella, et al. Supergen: An effi- cient ultra-high-resolution video generation system with sketching and tiling.arXiv preprint arXiv:2508.17756, 2025

work page arXiv 2025
[43]

vllm-omni: Fully disaggregated serving for any-to-any multimodal models.arXiv preprint arXiv:2602.02204, 2026

Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal mod- els.arXiv preprint arXiv:2602.02204, 2026

work page arXiv 2026
[44]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating sys- tems design and implementation (OSDI 22), pages 521– 538, 2022

2022
[45]

Open-sora 2.0: Training a commercial-level video generation model in $200 k

Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025

work page arXiv 2025
[46]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chen- hui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratiz- ing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 15

work page internal anchor Pith review arXiv 2024