Optimus: Elastic Decoding for Efficient Diffusion LLM Serving
Pith reviewed 2026-06-30 00:13 UTC · model grok-4.3
The pith
Optimus makes diffusion LLM decoding block size a runtime variable that a closed-loop scheduler adjusts to current load, sustaining high throughput without model retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Optimus enables elastic decoding for diffusion LLMs by combining chunked decoding, which permits fine-grained execution steps without retraining, with saturation-aware scheduling that selects chunk sizes from runtime observations. This treats decoding granularity as a controllable variable that trades off GPU utilization against token efficiency, yielding up to 6.1x higher throughput than autoregressive decoding and 4.3x higher than fixed-block diffusion while preserving output accuracy.
What carries the argument
Chunked decoding with saturation-aware closed-loop scheduling that selects decoding granularity at runtime to balance utilization and redundant computation.
If this is right
- Throughput improves by up to 6.1 times relative to autoregressive decoding.
- Throughput improves by up to 4.3 times relative to fixed-block diffusion decoding.
- Performance stays stable as offered load moves from low to high utilization.
- End-to-end serving capacity rises under latency SLOs because idle time and redundant work both decrease.
Where Pith is reading between the lines
- The same chunk-and-schedule pattern could extend to other iterative non-autoregressive generators that currently fix their step size.
- Variable-granularity execution might reduce power draw in multi-tenant clusters by keeping GPUs closer to full utilization without extra tokens.
- Only the scheduler and kernel changes are needed, so the technique could be added to existing diffusion-LLM inference stacks without retraining.
Load-bearing premise
Runtime load can be measured accurately enough for the closed-loop scheduler to pick chunk sizes that improve utilization without adding overhead, and chunked decoding leaves the diffusion model's generated outputs unchanged.
What would settle it
Running the same diffusion model under rapidly varying load with fixed blocks versus Optimus and checking whether token-level accuracy or perplexity diverges when chunk sizes change.
Figures
read the original abstract
Large language model (LLM) serving is fundamentally limited by inefficient hardware utilization. Autoregressive (AR) decoding underutilizes GPUs due to its strictly sequential execution, while diffusion LLMs (DLLMs) improve throughput by decoding multiple tokens per iteration. However, fixed block-size diffusion decoding exhibits strong load sensitivity: large blocks exploit idle GPU resources under low load, but saturate early and incur substantial redundant computation under high load. As a result, throughput gains vanish beyond saturation, and no single decoding granularity performs well across dynamic serving workloads. We present Optimus, a serving system that enables elastic decoding for diffusion LLMs by dynamically adapting decoding granularity to runtime load. The key idea is to treat decoding granularity as a runtime control variable, balancing GPU utilization and token efficiency. Optimus combines chunked decoding, which enables fine-grained execution without retraining, with saturation-aware scheduling, a closed-loop mechanism that selects chunk sizes based on runtime conditions. Together with system-level optimizations and customized attention kernels, Optimus achieves significant performance improvements while preserving model accuracy. Experiments show that Optimus delivers up to 6.1x throughput improvement over AR decoding and 4.3x improvement over fixed-block diffusion LLM, while maintaining stable performance across diverse load regimes and improving end-to-end serving capacity under latency constraints. The source code is available at https://github.com/dubcyfor3/Optimus.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Optimus, a serving system for diffusion LLMs that enables elastic decoding by treating decoding granularity as a runtime control variable. It combines chunked decoding (to allow fine-grained execution without retraining) with a saturation-aware closed-loop scheduler that selects chunk sizes based on monitored runtime load, plus system optimizations and custom attention kernels. The central claim is that this yields up to 6.1× throughput over autoregressive decoding and 4.3× over fixed-block diffusion decoding while preserving accuracy and delivering stable performance across load regimes.
Significance. If the empirical results hold under rigorous evaluation, the work would address a practical limitation in diffusion-LLM serving—load sensitivity of fixed block sizes—by providing a dynamic mechanism that improves GPU utilization without model changes. The open-source release is a positive factor for reproducibility in systems research.
major comments (2)
- [Abstract] Abstract: the headline throughput claims (6.1× vs. AR, 4.3× vs. fixed-block) are stated without any accompanying experimental details, baselines, error bars, accuracy metrics, load definitions, or hardware configuration, rendering the central empirical claim impossible to evaluate from the supplied text.
- [Scheduling mechanism (inferred from abstract)] Saturation-aware scheduling description: the closed-loop mechanism is asserted to select chunk sizes from runtime load “without meaningful overhead” and while preserving diffusion behavior, yet no measurements of monitoring cost, decision latency, or oscillation risk under intra-iteration load fluctuations are supplied; these quantities are load-bearing for the claim that elastic decoding outperforms the fixed-block baseline that the paper itself shows saturates early.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline throughput claims (6.1× vs. AR, 4.3× vs. fixed-block) are stated without any accompanying experimental details, baselines, error bars, accuracy metrics, load definitions, or hardware configuration, rendering the central empirical claim impossible to evaluate from the supplied text.
Authors: We agree that the abstract, constrained by length, omits the supporting experimental details. The full manuscript provides these in Sections 4 (Experimental Setup) and 5 (Evaluation), including baselines (AR decoding and fixed-block diffusion), hardware (NVIDIA A100 GPUs), load definitions (request rates), accuracy metrics, and results with variability reporting. To improve standalone readability of the abstract, we will revise it to briefly reference the evaluation conditions and hardware. revision: partial
-
Referee: [Scheduling mechanism (inferred from abstract)] Saturation-aware scheduling description: the closed-loop mechanism is asserted to select chunk sizes from runtime load “without meaningful overhead” and while preserving diffusion behavior, yet no measurements of monitoring cost, decision latency, or oscillation risk under intra-iteration load fluctuations are supplied; these quantities are load-bearing for the claim that elastic decoding outperforms the fixed-block baseline that the paper itself shows saturates early.
Authors: Section 3.2 describes the saturation-aware closed-loop scheduler and its design for low overhead. We acknowledge that explicit measurements of monitoring cost, decision latency, and oscillation risk under fluctuating loads are not quantified in the current version. We will add these measurements (including overhead breakdowns and stability analysis) to the revised evaluation section to directly support the scheduler's claims relative to the fixed-block baseline. revision: yes
Circularity Check
No significant circularity; empirical systems contribution with no load-bearing derivations or self-citation chains
full rationale
The paper describes a serving system (Optimus) using chunked decoding and saturation-aware scheduling for diffusion LLMs. Throughput claims (6.1x vs AR, 4.3x vs fixed-block) are presented as experimental measurements under various loads, not as outputs of any mathematical derivation, fitted parameters, or uniqueness theorems. No equations, ansatzes, or self-citations are invoked to force the central results. The approach is self-contained as an engineering implementation with runtime monitoring and kernel optimizations; the reader's assessment of score 1.0 aligns with the absence of any reduction of claims to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report
2023
-
[2]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Arney Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2025. Efficient LLM Inference via Chunked Prefills.SIGOPS Oper. Syst. Rev.59, 1 (Aug. 2025), 9–16. doi:10.1145/3759441.3759444
-
[3]
anon8231489123. 2023. ShareGPT Vicuna Unfiltered Dataset. https://huggingface.co/datasets/anon8231489123/ShareGPT_ Vicuna_unfiltered
2023
-
[4]
Chiu, Zhihan Yang, Zhix- uan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov
Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhix- uan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov
-
[5]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Block Diffusion: Interpolating Between Autoregressive and Dif- fusion Language Models. arXiv:2503.09573 [cs.LG]https://arxiv.org/ abs/2503.09573
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models
2021
-
[7]
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context under- standing. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137
2024
-
[8]
Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Li...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
2018.The psychology of human-computer interaction
Stuart K Card et al. 2018.The psychology of human-computer interaction. Crc Press, USA
2018
-
[10]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code
2021
-
[11]
Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai Li, Yiran Chen, et al. 2025. Dpad: Efficient diffusion lan- guage models with suffix dropout
2025
-
[12]
Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. 2025. SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation. arXiv:2510.06303 [cs.LG] https://arxiv.org/abs/2510.06303
-
[13]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems, 2021
2021
-
[14]
LMDeploy Contributors. 2023. LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM.https://github.com/InternLM/lmdeploy
2023
-
[15]
Kamaluddeen Usman Danyaro, Maged Nasser, Abubakar Zakari, Shamsu Abdullahi, Atika Khanzada, Muhammad Muntasir Yakubu, Sara Shoaib, et al. 2025. LLM-Based Code Generation: A Systematic Literature Review With Technical and Demographic Insights.IEEE Access13 (2025), 194915–194939. 13
2025
-
[16]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer
-
[17]
int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332
2022
-
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
-
[19]
InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)
Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186
2019
-
[20]
Shengyue Guan, Jindong Wang, Jiang Bian, Bin Zhu, Jian-Guang Lou, and Haoyi Xiong. 2026. Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey.ACM Trans. Intell. Syst. Technol.(Feb. 2026). doi:10.1145/3793671Just Accepted
-
[21]
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ah- mad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Am- inabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al . 2024. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference
2024
- [22]
-
[23]
Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae- sun Seo, Zhiru Zhang, and Udit Gupta. 2025. Accelerating diffusion language model inference via efficient kv caching and guided diffusion. arXiv–2505 pages
2025
-
[24]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.0...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Gonzalez, Hao Zhang, and Ion Sto- ica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Association for Comput- ing Machinery, New York, NY, USA, 611–626
2023
-
[26]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA) (ICML’23). JMLR.org, USA, Article 795, 13 pages
2023
-
[27]
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan, Li Shen, Yi Liang, Soroush Vosoughi, and Shiwei Liu. 2025. Diffusion Language Models Know the Answer Before Decoding
2025
-
[28]
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tat- sunori B Hashimoto. 2022. Diffusion-lm improves controllable text generation.Advances in neural information processing systems35 (2022), 4328–4343
2022
-
[29]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle: Speculative sampling requires rethinking feature uncertainty
2024
-
[30]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. Eagle-3: Scaling up inference acceleration of large language models via training-time test
2025
-
[31]
Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, Sean Lai, Xinhao Cheng, Xupeng Miao, and Zhihao Jia. 2025. AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding. arXiv:2501.12162 [cs.CL]https://arxiv.org/abs/ 2501.12162
-
[32]
Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, and Jie Zhou. 2025. Wedlm: Reconciling diffusion language models with standard causal attention for fast inference
2025
-
[33]
Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. 2025. TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput. arXiv:2406.14066 [cs.AI]https://arxiv.org/abs/2406.14066
-
[34]
Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. 2025. dllm-cache: Accelerating diffusion large language models with adaptive caching
2025
-
[35]
Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. 2025. dkv-cache: The cache for diffusion language models
2025
-
[36]
Mehran Nasseri, Patrick Brandtner, Robert Zimmermann, Taha Fala- touri, Farzaneh Darbanian, and Tobechi Obinwanne. 2023. Applica- tions of large language models (llms) in business analytics–exemplary use cases in data preparation tasks. InInternational conference on human-computer interaction. Springer, Springer-Verlag, Berlin, Heidel- berg, 182–198
2023
-
[37]
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Large Language Diffusion Models. arXiv:2502.09992 [cs.CL]https: //arxiv.org/abs/2502.09992
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
NVIDIA. 2026. Matrix Multiplication Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance- matrix-multiplication/index.html
2026
-
[39]
Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150 [cs.NE]https://arxiv.org/abs/1911. 02150
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[40]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism
2019
-
[41]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multi- modal models
2023
- [42]
-
[43]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Con- ference on Neural Information Processing Systems(Long Beach, Cali- fornia, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010
2017
-
[44]
Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. 2025. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing
2025
- [45]
-
[46]
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. 2025. Fast-dllm v2: Efficient block-diffusion llm
2025
-
[47]
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. 2025. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding
2025
-
[48]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for 14 {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22). USENIX Association, USA, 521–538
2022
-
[50]
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. 2023. Are trans- formers effective for time series forecasting?. InProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty- Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelli- gence (AA...
-
[51]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P Xing, et al. 2023. Lmsys-chat-1m: A large-scale real-world llm conversation dataset
2023
-
[52]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured lan- guage model programs.Advances in neural information processing systems37 (2024), 62557–62583
2024
-
[53]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, USA, 193–210
2024
-
[54]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serv- ing. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Art...
2024
-
[55]
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models
2023
-
[56]
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongx- uan Li. 2025. LLaDA 1.5: Variance-Reduced Preference Optimiza- tion for Large Language Diffusion Models. arXiv:2505.19223 [cs.LG] https://arxiv.org/abs/2505.19223
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, et al. 2025. {NanoFlow}: Towards optimal large language model serving through- put. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, USA, 749–765. 15
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.