DiLaServe: High SLO Attainment Serving for Diffusion Language Models
Pith reviewed 2026-06-30 09:17 UTC · model grok-4.3
The pith
DiLaServe achieves up to 56.6 percentage points higher SLO attainment for diffusion language models by deadline-aware scheduling and quality-aware cluster reconfiguration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiLaServe is a cluster-level serving system for DLMs that enables deadline-aware scheduling and adaptive load control through confidence-threshold adjustment, and dynamically reconfigures the cluster by solving a quality-aware optimization problem while explicitly modeling the step-level heterogeneity introduced by approximate KV caching. Across multiple benchmarks and real-world traces, this yields up to 56.6 percentage points better SLO attainment and up to 46% lower end-to-end request latency while keeping accuracy drop below 1%.
What carries the argument
The quality-aware optimization problem that dynamically reconfigures the cluster while modeling step-level heterogeneity from approximate KV caching.
If this is right
- Deadline-aware scheduling paired with confidence-threshold adjustment allows DLMs to meet latency targets while preserving output quality.
- Accounting for non-uniform per-step costs from approximate KV caching improves decisions about parallelization levels under changing load.
- The resulting system delivers measurable reductions in end-to-end latency alongside higher SLO attainment across evaluated traces.
- Accuracy remains within 1% of baseline on the tested benchmarks when the optimization is applied.
Where Pith is reading between the lines
- Similar deadline and cost-modeling techniques could transfer to other parallel token-generation architectures that exhibit variable per-step work.
- If KV-cache approximations become more common in production, explicit heterogeneity modeling may become a standard component of serving schedulers.
- Extending the optimization to include energy or memory constraints would be a direct next step given the current formulation.
Load-bearing premise
The modeling of step-level heterogeneity from approximate KV caching and the quality-aware optimization problem accurately reflect production dynamics, and the chosen benchmarks plus traces are representative of real serving workloads.
What would settle it
Running DiLaServe on a fresh real-world trace or benchmark workload and measuring whether the reported gains in SLO attainment and latency reduction still appear.
Figures
read the original abstract
Diffusion language models (DLMs) have recently emerged as a promising alternative to conventional autoregressive language models. By generating multiple tokens in parallel during each denoising step, they offer higher inference throughput while maintaining competitive quality. However, realizing these throughput gains while meeting latency SLOs in a serving system requires addressing challenges introduced by DLMs' unique characteristics. These include navigating the speed-quality tradeoff created by confidence-based denoising, choosing appropriate parallelization levels across model instances under fluctuating load, and coordinating approximate KV caching mechanisms that introduce non-uniform per-step costs. To address these challenges, we present DiLaServe, a cluster-level serving system for DLMs. DiLaServe enables deadline-aware scheduling and adaptive load control through confidence-threshold adjustment, and dynamically reconfigures the cluster by solving a quality-aware optimization problem, while explicitly modeling the step-level heterogeneity introduced by approximate KV caching. Across multiple benchmarks and real-world traces, DiLaServe improves SLO attainment by up to 56.6 percentage points and reduces end-to-end request latency by up to 46\% while incurring less than 1\% accuracy drop.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents DiLaServe, a cluster-level serving system for diffusion language models (DLMs). DLMs generate multiple tokens in parallel per denoising step but introduce speed-quality tradeoffs via confidence-based denoising, variable parallelization needs under load, and non-uniform per-step costs from approximate KV caching. DiLaServe addresses these via deadline-aware scheduling, adaptive load control through confidence-threshold adjustment, and dynamic cluster reconfiguration by solving a quality-aware optimization problem that explicitly models the step-level heterogeneity. Across benchmarks and real-world traces, it reports up to 56.6 percentage points higher SLO attainment, up to 46% lower end-to-end latency, and <1% accuracy drop.
Significance. If the empirical claims hold under the modeled costs, the work is significant for practical deployment of DLMs, which promise higher throughput than autoregressive models but face unique serving challenges. The explicit modeling of KV-cache-induced heterogeneity and the quality-aware optimizer are potential strengths if they are shown to generalize beyond the evaluated traces.
major comments (1)
- [experimental evaluation / modeling of KV caching] The central empirical claims (56.6 pp SLO gain, 46% latency reduction) are produced by solving the quality-aware optimization under the modeled step-level costs from approximate KV caching. The manuscript must include direct validation that these modeled per-step costs match measured execution times on real hardware under fluctuating load (e.g., in the experimental methodology or § on system implementation); without this, the optimizer may select configurations that fail to deliver the reported gains.
minor comments (1)
- [abstract] The abstract states quantitative gains but provides no details on experimental methodology, error bars, data exclusion rules, or statistical significance; this should be added to the abstract or a dedicated methods subsection.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below and will incorporate the requested validation in the revision.
read point-by-point responses
-
Referee: [experimental evaluation / modeling of KV caching] The central empirical claims (56.6 pp SLO gain, 46% latency reduction) are produced by solving the quality-aware optimization under the modeled step-level costs from approximate KV caching. The manuscript must include direct validation that these modeled per-step costs match measured execution times on real hardware under fluctuating load (e.g., in the experimental methodology or § on system implementation); without this, the optimizer may select configurations that fail to deliver the reported gains.
Authors: We agree that direct validation of the modeled per-step costs (derived from approximate KV caching) against measured execution times on real hardware under fluctuating load is necessary to confirm the optimizer produces realizable gains. The current manuscript models these costs from profiling in the system implementation section but does not include an explicit side-by-side comparison under dynamic load conditions. We will add this validation to the experimental methodology (new subsection) by reporting measured vs. modeled per-step latencies on the same hardware and traces used for the end-to-end results, thereby strengthening the link between the quality-aware optimizer and the reported SLO/latency improvements. revision: yes
Circularity Check
No circularity; empirical claims with no load-bearing derivations or self-referential fits
full rationale
The paper is a systems/empirical work whose central claims (SLO gains, latency reductions) are produced by experimental evaluation on benchmarks and traces. No equations, optimization formulations, or modeling steps are visible in the abstract or reader's summary that reduce by construction to fitted inputs or self-citations. The quality-aware optimization and step-level heterogeneity model are presented as design choices whose accuracy is evaluated externally rather than assumed by definition. This matches the reader's assessment of no visible derivations and warrants the default non-finding of score 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Deepak Agarwal, Bo Long, Jonathan Traupman, Doris Xin, and Liang Zhang. 2014. LASER: a scalable response prediction platform for online advertising. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (New York, New York, USA) (WSDM ’14). Association for Computing Machinery, New York, NY, USA, 173–182. doi:10.1145/2556195.2556252
-
[2]
Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv Kumar Saini. 2024. Approxi- mate caching for efficiently serving text-to-image diffusion models. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (Santa Clara, CA, USA) (NSDI’24). USENIX Association, USA, Article 65,...
2024
-
[3]
Friedman, Thomas Williams, Ramesh K
Sohaib Ahmad, Hui Guan, Brian D. Friedman, Thomas Williams, Ramesh K. Sitaraman, and Thomas Woo. 2024. Proteus: A High-Throughput Inference-Serving System with Accuracy Scal- ing. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (La Jolla, CA, USA) (ASPLOS ’24). Asso...
-
[4]
Sitaraman, and Hui Guan
Sohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K. Sitaraman, and Hui Guan. 2025. DiffServe: Efficiently Serving Text-to-Image Diffu- sion Models with Query-Aware Model Scaling. In Eighth Conference on Machine Learning and Systems.https://openreview.net/forum? id=1N3ShLfcTf
2025
-
[5]
Anthropic. 2025. Claude Code.https://claude.com/product/claude- code. Accessed: 2025-12-10
2025
-
[6]
Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. 2021. Structured denoising diffusion mod- els in discrete state-spaces. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21). Curran Associates Inc., Red Hook, NY, USA, Article 1376, 13 pages
2021
-
[7]
A is B” fail to learn “B is A
Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2024. The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=GPKTIktA0k
2024
-
[8]
Tzu-Tao Chang and Shivaram Venkataraman. 2025. Eva: Cost-Efficient Cloud-Based Cluster Scheduling. In Proceedings of the Twentieth European Conference on Computer Systems (Rotterdam, Nether- lands) (EuroSys ’25). Association for Computing Machinery, New York, NY, USA, 1399–1416. doi:10.1145/3689031.3717483
-
[9]
Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. 2025. SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation. arXiv:2510.06303 [cs.LG] https://arxiv.org/abs/2510.06303 13
-
[10]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Train- ing Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Franklin, Joseph E
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 613–627.https://www.usenix.org/ conference/nsdi17/technical-sessions/presentation...
2017
-
[12]
Yinwei Dai, Rui Pan, Anand Iyer, Kai Li, and Ravi Netravali. 2024. Ap- parate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (Austin, TX, USA) (SOSP ’24). As- sociation for Computing Machinery, New York, NY, USA, 607–623. doi:10.1145/3694715.3695963
-
[13]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FLASHATTENTION: fast and memory-efficient exact at- tention with IO-awareness. In Proceedings of the 36th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Ar- ticle 1189, 16 pages
2022
-
[14]
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232
2001
-
[15]
Aditya Ganjam, Faisal Siddiqui, Jibin Zhan, Xi Liu, Ion Stoica, Junchen Jiang, Vyas Sekar, and Hui Zhang. 2015. C3: Internet-Scale Control Plane for Video Quality Optimization. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, Oakland, CA, 131–144.https://www.usenix. org/conference/nsdi15/technical-sess...
2015
-
[16]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 2, 1 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Google DeepMind. 2025. Gemini Diffusion.https://deepmind.google/ models/gemini-diffusion/. [text diffusion model]
2025
-
[18]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 443–462.https: //www.usenix.org/conference/osdi20/presentation/gujarati
2020
-
[19]
Peizhen Guo, Bo Hu, and Wenjun Hu. 2022. Sommelier: Cu- rating DNN Models for the Masses. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1876–1890. doi:10.1145/3514221.3526173
-
[20]
Gurobi Optimization, LLC. 2025. Gurobi Optimizer Reference Manual. https://www.gurobi.com
2025
-
[21]
Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. 2023. DiffusionBERT: Improving Genera- tive Masked Language Models with Diffusion Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki (Eds.). Asso...
-
[22]
Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, and Yu Wang. 2025. SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. In Eighth Conference on Machine Learning and Systems. https://openreview.net/forum?id=ubIvpetAd6
2025
-
[23]
Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan
-
[24]
In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA)(USENIX ATC ’25)
DEEPSERVE: serverless large language model serving at scale. In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA)(USENIX ATC ’25). USENIX Association, USA, Article 4, 16 pages
2025
-
[25]
Leonard Kleinrock. 1975. Theory, Volume 1, Queueing Systems. Wiley-Interscience, USA
1975
-
[26]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
-
[27]
Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (Koblenz, Germany) (SOSP ’23). As- sociation for Computing Machinery, New York, NY, USA, 611–626. doi:10.1145/3600006.3613165
-
[28]
Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Mi- raoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and Volodymyr Kuleshov. 2025. Mercury: Ultra-Fast Language Models Based on Dif- fusion. arXiv:2506.17298 [cs.CL]https://arxiv.org/abs/2506.17298
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Dakai An, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, and Wei Wang. 2025. KATZ: effi- cient workflow serving for diffusion models with many adapters. In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA...
2025
-
[30]
Tianyi Li, Mingda Chen, Bowei Guo, and Zhiqiang Shen. 2025. A Survey on Diffusion Language Models. arXiv:2508.10875 [cs.CL]https: //arxiv.org/abs/2508.10875
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Gon- zalez, and Ion Stoica
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gon- zalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, M...
2023
-
[32]
Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-Eval: Unified Multi- Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models. In Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), Yun-Nung Chen and Abhinav Rastogi (Eds.). Association for Computational Linguistics, Toronto, Canada, 47–58. doi:10.1865...
-
[33]
Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. 2025. dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching. arXiv:2506.06295 [cs.LG]https://arxiv.org/abs/2506.06295
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Aaron Lou, Chenlin Meng, and Stefano Ermon. 2024. Discrete diffusion modeling by estimating the ratios of the data distri- bution. In Proceedings of the 41st International Conference on Machine Learning (Vienna, Austria) (ICML’24). JMLR.org, Article 1333, 30 pages
2024
-
[35]
Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. 2022. Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 579–596.https://www.usenix.org/conference/osdi22/ presentation/mohan
2022
-
[36]
Ilyas, Theodoros Rekatsinas, and Shivaram Venkataraman
Jason Mohoney, Devesh Sarda, Mengze Tang, Shihabur Rahman Chowdhury, Anil Pacaci, Ihab F. Ilyas, Theodoros Rekatsinas, and Shivaram Venkataraman. 2025. Quake: adaptive indexing for vector search. In Proceedings of the 19th USENIX Conference on Operating 14 Systems Design and Implementation (Boston, MA, USA) (OSDI ’25). USENIX Association, USA, Article 9, 17 pages
2025
-
[37]
Jordan, and Ion Stoica
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tu- manov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Dis- tributed Framework for Emerging AI Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 5...
2018
-
[38]
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Large Language Diffusion Models. arXiv preprint arXiv:2502.09992 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Augustus Odena, Charles Sutton, David Martin Dohan, Ellen Jiang, Henryk Michalewski, Jacob Austin, Maarten Paul Bosma, Maxwell Nye, Michael Terry, and Quoc V. Le. 2021. Program Synthesis with Large Language Models. In n/a. n/a, n/a. n/a
2021
-
[40]
OpenAI. 2024. GPT-4 Technical Report. arXiv (2024).https://arxiv. org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Anand Padmanabha Iyer, Mingyu Guan, Yinwei Dai, Rui Pan, Swapnil Gandhi, and Ravi Netravali. 2024. Improving DNN Inference Through- put Using Practical, Per-Input Compute Adaptation. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (Austin, TX, USA) (SOSP ’24). Association for Computing Machinery, New York, NY, USA, 624–639. ...
-
[42]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 118–132. doi:10.1109/ISCA59077.2024.00019
-
[43]
Yadwadkar, and Christos Kozyrakis
Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 397–411.https://www.usenix.org/conference/ atc21/presentation/romero
2021
-
[44]
Emma Roth. 2025. OpenAI says ChatGPT users send over 2.5 billion prompts every day. The Verge (2025).https://www.theverge.com/ news/710867/openai-chatgpt-daily-prompts-2-billionAccessed: 2025- 12-10
2025
-
[45]
Subham Sekhar Sahoo, Marianne Arriola, Aaron Gokaslan, Edgar Mar- iano Marroquin, Alexander M Rush, Yair Schiff, Justin T Chiu, and Volodymyr Kuleshov. 2024. Simple and Effective Masked Diffusion Language Models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems.https://openreview.net/forum?id= L4uaAR4ArM
2024
-
[46]
Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP ’19). Asso- ciation for Computing Machinery, N...
-
[47]
Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. 2025. Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference. arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). 1348–1362. doi:10.1109/HPCA61900.2025.00102
-
[49]
Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tar- nawski, and Ana Klimovic. 2024. DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian W...
2024
-
[50]
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: dynamic scheduling for large lan- guage model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA) (OSDI’24). USENIX Association, USA, Article 10, 19 pages
2024
-
[51]
ShareGPT Team. [n. d.]. ShareGPT.https://sharegpt.com/
-
[52]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]https://arxiv. org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems (KDD ’25). Association for Computing Machinery, New York, NY, USA, 5831–5841. doi:10.1145/3711896.3737413
-
[54]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chan- dra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. MMLU-Pro: a more robust and challeng- ing multi-task language understanding benchmark. In Proceedings of the 38th International Conference...
2024
- [55]
-
[56]
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. 2025. Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding. arXiv:2505.22618 [cs.CL]https://arxiv.org/abs/ 2505.22618
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. 2025. Dream 7B: Diffusion Large Language Models. arXiv preprint arXiv:2508.15487 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica
-
[59]
In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)
SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808.https://www. usenix.org/conference/nsdi23/presentation/zhang-hong
-
[60]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA...
2023
-
[61]
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyuman- shan Ye, Pengrui Lu, and Pengfei Liu. 2025. DeepResearcher: Scal- ing Deep Research via Reinforcement Learning in Real-world En- vironments. In Proceedings of the 2025 Conference on Empirical 15 Methods in Natural Language Processing, Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Ros...
-
[62]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggre- gating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA) (OSDI’24). USENIX Association, USA, Ar...
2024
-
[63]
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongx- uan Li. 2025. LLaDA 1.5: Variance-Reduced Preference Optimiza- tion for Large Language Diffusion Models. arXiv:2505.19223 [cs.LG] https://arxiv.org/abs/2505.19223
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhen- zhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, and Ji-Rong Wen. 2025. LLaDA-MoE: A Sparse MoE Di...
-
[65]
Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien- Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. 2025. NanoFlow: towards optimal large language model serving throughput. In Proceedings of the 19th USENIX Conference on Operating Systems Design...
2025
-
[66]
score": <0-10>,
Safety Rules: - Judge only the assistant answer. - Prefer factual accuracy over style. - Penalize unsafe or harmful advice heavily. - If the request does not provide enough information to fully verify facts, score based on likely usefulness and internal consistency. - Return only valid JSON matching the required schema. - The score must be an integer from...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.