Recognition: no theorem link
Fast Distributed Inference Serving for Large Language Models
Pith reviewed 2026-05-17 11:17 UTC · model grok-4.3
The pith
FastServe enables token-level preemption and skip-join scheduling for LLM inference to raise throughput while holding latency fixed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FastServe is a distributed inference serving system for large language models that performs preemption at the granularity of each output token, employs a skip-join Multi-Level Feedback Queue scheduler that uses input length to assign an appropriate initial queue while skipping higher ones to reduce demotions, and uses proactive offloading of intermediate states between GPU and host memory; these mechanisms together produce throughput gains of up to 31.4x under average latency requirements and 17.9x under tail latency requirements compared with vLLM.
What carries the argument
The skip-join Multi-Level Feedback Queue scheduler that assigns each arrival job to an initial queue based on input length and skips higher-priority queues to reduce demotions while supporting token-level preemption.
If this is right
- More concurrent interactive sessions can be supported on the same GPU cluster without violating latency targets.
- Requests of widely varying input lengths experience less interference with one another.
- GPU memory is freed more promptly for new work, raising overall hardware utilization.
- Serving clusters can be sized smaller while still meeting service-level objectives.
Where Pith is reading between the lines
- The same token-level preemption idea could be applied to other autoregressive generation tasks such as code completion or image captioning.
- Integration with existing model-parallel frameworks would be needed to test whether the gains scale to models that do not fit on a single device.
- Production deployments would benefit from adding runtime feedback to adjust queue levels when input-length statistics drift.
Load-bearing premise
Token-level preemption and the skip-join MLFQ assignment based on input length incur low enough overhead to deliver the reported gains without hidden costs in real workloads.
What would settle it
A workload trace in which the measured preemption and offload overhead exceeds the latency reduction achieved by the scheduler would show that the throughput gains disappear.
read the original abstract
Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FastServe, a distributed LLM inference serving system that exploits the autoregressive generation pattern to support preemption at token granularity. It introduces a skip-join Multi-Level Feedback Queue scheduler that assigns jobs to initial queues using input length information while skipping higher-priority queues to limit demotions, paired with a proactive GPU memory manager that offloads and restores KV-cache states to host memory. A prototype implementation is evaluated against vLLM, with the central empirical claim being throughput gains of up to 31.4× under equivalent average latency and 17.9× under tail latency constraints.
Significance. If the performance claims are substantiated with overhead measurements and reproducible workloads, the work would offer a practical advance for latency-sensitive LLM serving by showing how token-level preemption can reduce head-of-line blocking. The semi-information-agnostic scheduling heuristic represents a pragmatic engineering compromise that could be adopted in production systems.
major comments (2)
- [System Design] System Design (preemption and memory management description): the central throughput claims rest on the assumption that token-granularity preemption plus repeated KV-cache offload/restore incurs negligible overhead. No quantitative bound is given on PCIe transfer time relative to per-token compute time, nor is the demotion frequency under realistic output-length distributions analyzed; without this, the 31.4×/17.9× gains cannot be confidently attributed to the scheduler rather than hidden costs.
- [Experimental evaluation] Experimental evaluation: the reported speedups versus vLLM are presented without sufficient detail on workload traces, model sizes, hardware configuration, exact latency targets, or whether post-hoc tuning occurred. This information is load-bearing for assessing whether the gains generalize beyond the specific prototype runs.
minor comments (2)
- [Introduction] The phrase 'semi-information-agnostic setting' is used in the abstract and introduction without a precise definition or comparison to fully agnostic or fully aware baselines; adding a short clarifying paragraph would improve accessibility.
- [Evaluation] Figure captions and axis labels in the evaluation section should explicitly state the latency SLO values used for the throughput comparisons to allow direct interpretation of the 31.4× and 17.9× numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of overhead analysis and experimental details.
read point-by-point responses
-
Referee: [System Design] System Design (preemption and memory management description): the central throughput claims rest on the assumption that token-granularity preemption plus repeated KV-cache offload/restore incurs negligible overhead. No quantitative bound is given on PCIe transfer time relative to per-token compute time, nor is the demotion frequency under realistic output-length distributions analyzed; without this, the 31.4×/17.9× gains cannot be confidently attributed to the scheduler rather than hidden costs.
Authors: We agree that explicit quantitative bounds on overheads would improve attribution of the reported gains. In the revised manuscript we have added a dedicated analysis subsection that measures PCIe transfer latency for KV-cache offload/restore operations relative to per-token generation time across the evaluated models and hardware. We also report demotion frequencies measured under output-length distributions drawn from public conversation traces, showing that the skip-join mechanism keeps demotions low. These new results confirm that the overhead remains small and that the throughput improvements are primarily due to reduced head-of-line blocking. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation: the reported speedups versus vLLM are presented without sufficient detail on workload traces, model sizes, hardware configuration, exact latency targets, or whether post-hoc tuning occurred. This information is load-bearing for assessing whether the gains generalize beyond the specific prototype runs.
Authors: We acknowledge that the original experimental section lacked sufficient detail for full reproducibility and generalization assessment. The revised manuscript expands the evaluation section to specify the exact workload traces (both synthetic and real traces from public sources), the model sizes and architectures tested, the precise hardware configuration (GPU models, memory, and interconnect), the concrete average and tail latency targets used to compute throughput, and an explicit statement that no post-hoc parameter tuning was applied beyond the design choices described in the paper. We have also made the evaluation configurations and scripts available as supplementary material. revision: yes
Circularity Check
No significant circularity; empirical systems evaluation
full rationale
The paper presents a systems design for LLM inference serving that relies on token-granularity preemption, proactive KV-cache offload, and a skip-join MLFQ scheduler whose initial queue assignment uses only input length. All load-bearing claims are throughput and latency improvements measured on a prototype implementation versus vLLM; no equations, fitted parameters, or first-principles derivations appear in the provided text. Consequently no step reduces by construction to its own inputs, self-citations, or ansatzes. The result is an independent empirical observation rather than a tautological renaming or prediction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 18 Pith papers
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.
-
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
-
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
-
Efficient Remote KV Cache Reuse with GPU-native Video Codec
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
PRISM reduces P99 TTFT by 23.3-37.1% and raises exact-prefix KV-cache hit rates by 5.9-12.2 points versus the strongest baseline on 4B and 13B models by jointly optimizing scheduling and memory.
-
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and A...
-
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
BalanceRoute reduces data-parallel imbalance in LLM inference via F-score routing and lookahead, yielding higher end-to-end throughput on 144-NPU clusters versus vLLM baselines.
-
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
-
EdgeFM: Efficient Edge Inference for Vision-Language Models
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
-
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.
-
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
-
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...
-
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.
-
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.
Reference graph
Works this paper leans on
- [1]
-
[2]
ChatGPT sets record for fastest-growing user base
“ChatGPT sets record for fastest-growing user base.” https://www.reuters.com/technology/chatgpt- sets-record-fastest-growing-user-base- analyst-note-2023-02-01/ , 2023
work page 2023
-
[3]
Reinventing search with a new ai-powered bing and edge, your copilot for the web
“Reinventing search with a new ai-powered bing and edge, your copilot for the web.” https:// news.microsoft.com/the-new-Bing/, 2023
work page 2023
-
[4]
Our next-generation model: Gemini 1.5
Google, “Our next-generation model: Gemini 1.5.” https://blog.google/technology/ai/google- gemini-next-generation-model-february- 2024/, 2024
work page 2024
-
[5]
Introducing the next generation of Claude
Anthropic, “Introducing the next generation of Claude.” https://www.anthropic.com/news/claude-3- family, 2024
work page 2024
- [6]
-
[7]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016
work page 2016
-
[8]
Serving DNNs like clockwork: Performance predictability from the bottom up,
A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kauf- mann, Y . Vigfusson, and J. Mace, “Serving DNNs like clockwork: Performance predictability from the bottom up,” in USENIX OSDI, 2020
work page 2020
-
[9]
Shep- herd: Serving dnns in the wild,
H. Zhang, Y . Tang, A. Khandelwal, and I. Stoica, “Shep- herd: Serving dnns in the wild,” inUSENIX NSDI, 2023
work page 2023
-
[10]
Orca: A distributed serving system for Transformer-Based generative models,
G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.- G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in USENIX OSDI, 2022
work page 2022
-
[11]
Efficient memory management for large language model serving with pagedattention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in ACM SOSP, 2023
work page 2023
- [13]
-
[14]
Stanford alpaca: An instruction-following llama model
R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model.” https: //github.com/tatsu-lab/stanford_alpaca, 2023
work page 2023
-
[15]
Information-agnostic flow scheduling for commodity data centers,
W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang, “Information-agnostic flow scheduling for commodity data centers,” in USENIX OSDI, 2015
work page 2015
-
[16]
Megatron-lm: Training multi-billion parameter language models using model parallelism,
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv, 2020
work page 2020
-
[17]
Gpipe: Efficient training of giant neural networks using pipeline parallelism,
Y . Huang, Y . Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Neural Information Processing Systems, 2019
work page 2019
-
[18]
Opt: Open pre-trained transformer language models,
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, T. Mi- haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” arXiv, 2022
work page 2022
-
[19]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...
work page 2020
-
[20]
Llama: Open and efficient foundation language models,
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lam- ple, “Llama: Open and efficient foundation language models,” arXiv, 2023
work page 2023
-
[21]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “At- tention is all you need,” Neural Information Processing Systems, 2017. 13
work page 2017
-
[22]
Tensorflow-serving: Flexible, high-performance ml serving,
C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V . Rajashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high-performance ml serving,” arXiv, 2017
work page 2017
-
[23]
Triton inference server: An optimized cloud and edge inferencing solution.,
N. Corporation, “Triton inference server: An optimized cloud and edge inferencing solution.,” 2019
work page 2019
-
[24]
fairseq: A fast, extensible toolkit for sequence modeling,
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” arXiv, 2019
work page 2019
-
[25]
Huggingface’s transform- ers: State-of-the-art natural language processing,
T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jer- nite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Huggingface’s transform- ers: State-of-the-art natural language processing,”arXiv, 2020
work page 2020
- [26]
-
[27]
A proof of the optimality of the shortest remaining processing time discipline,
L. Schrage, “A proof of the optimality of the shortest remaining processing time discipline,” Operations Re- search, 1968
work page 1968
-
[28]
Fast transformer decoding: One write-head is all you need,
N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv, 2019
work page 2019
-
[29]
Gqa: Training generalized multi-query transformer models from multi-head check- points,
J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head check- points,” arXiv, 2023
work page 2023
-
[30]
How long can open-source llms truly promise on context length?,
D. Li*, R. Shao*, A. Xie, Y . Sheng, L. Zheng, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “How long can open-source llms truly promise on context length?,” 2023
work page 2023
-
[31]
Finishing flows quickly with preemptive scheduling,
C.-Y . Hong, M. Caesar, and P. B. Godfrey, “Finishing flows quickly with preemptive scheduling,” in ACM SIGCOMM, 2012
work page 2012
-
[32]
pfabric: Minimal near- optimal datacenter transport,
M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker, “pfabric: Minimal near- optimal datacenter transport,” SIGCOMM CCR, 2013
work page 2013
-
[33]
Efficient coflow schedul- ing without prior knowledge,
M. Chowdhury and I. Stoica, “Efficient coflow schedul- ing without prior knowledge,”SIGCOMM CCR, 2015
work page 2015
-
[34]
Tiresias: A gpu clus- ter manager for distributed deep learning.,
J. Gu, M. Chowdhury, K. G. Shin, Y . Zhu, M. Jeon, J. Qian, H. H. Liu, and C. Guo, “Tiresias: A gpu clus- ter manager for distributed deep learning.,” in USENIX NSDI, 2019
work page 2019
-
[35]
Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,
Z. Li, L. Zheng, Y . Zhong, V . Liu, Y . Sheng, X. Jin, Y . Huang, Z. Chen, H. Zhang, J. E. Gonzalez,et al., “Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,” in USENIX OSDI, 2023
work page 2023
-
[36]
Scaling laws for neural language models,
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv, 2020
work page 2020
-
[37]
Efficient large-scale language model training on gpu clusters using megatron-lm,
D. Narayanan, M. Shoeybi, J. Casper, P. LeGres- ley, M. Patwary, V . A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan- ishayee, and M. Zaharia, “Efficient large-scale language model training on gpu clusters using megatron-lm,” arXiv, 2021
work page 2021
-
[38]
Pipedream: Generalized pipeline parallelism for dnn training,
D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Za- haria, “Pipedream: Generalized pipeline parallelism for dnn training,” in ACM SOSP, 2019
work page 2019
-
[39]
Ray: A distributed framework for emerging AI applications,
P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging AI applications,” in USENIX OSDI, 2018
work page 2018
-
[40]
Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks,
G. Prekas, M. Kogias, and E. Bugnion, “Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks,” in ACM SOSP, 2017
work page 2017
-
[41]
Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving,
Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving,” in USENIX OSDI, 2024
work page 2024
-
[42]
B. Wu, S. Liu, Y . Zhong, P. Sun, X. Liu, and X. Jin, “Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism,” arXiv, 2024
work page 2024
-
[43]
Efficient coflow scheduling with varys,
M. Chowdhury, Y . Zhong, and I. Stoica, “Efficient coflow scheduling with varys,” in ACM SIGCOMM , 2014
work page 2014
-
[44]
Shinjuku: Preemptive schedul- ing for µsecond-scale tail latency,
K. Kaffes, T. Chong, J. T. Humphries, A. Belay, D. Maz- ières, and C. Kozyrakis, “Shinjuku: Preemptive schedul- ing for µsecond-scale tail latency,” in USENIX NSDI, 2019
work page 2019
-
[45]
Shenango: Achieving high cpu efficiency for latency-sensitive datacenter workloads.,
A. Ousterhout, J. Fried, J. Behrens, A. Belay, and H. Bal- akrishnan, “Shenango: Achieving high cpu efficiency for latency-sensitive datacenter workloads.,” inUSENIX NSDI, 2019
work page 2019
-
[46]
Caladan: Mitigating interference at microsecond timescales,
J. Fried, Z. Ruan, A. Ousterhout, and A. Belay, “Caladan: Mitigating interference at microsecond timescales,” in USENIX OSDI, 2020
work page 2020
-
[47]
Pipeswitch: Fast pipelined context switching for deep learning applica- tions,
Z. Bai, Z. Zhang, Y . Zhu, and X. Jin, “Pipeswitch: Fast pipelined context switching for deep learning applica- tions,” in USENIX OSDI, 2020. 14
work page 2020
-
[48]
Microsecond- scale preemption for concurrent GPU-accelerated DNN inferences,
M. Han, H. Zhang, R. Chen, and H. Chen, “Microsecond- scale preemption for concurrent GPU-accelerated DNN inferences,” in USENIX OSDI, 2022
work page 2022
-
[49]
Clipper: A low-latency online prediction serving system.,
D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system.,” in USENIX NSDI, 2017
work page 2017
-
[50]
Turbotransform- ers: an efficient gpu serving system for transformer mod- els,
J. Fang, Y . Yu, C. Zhao, and J. Zhou, “Turbotransform- ers: an efficient gpu serving system for transformer mod- els,” in ACM PPoPP, 2021
work page 2021
-
[51]
Mpcformer: fast, performant and private transformer inference with mpc,
D. Li, R. Shao, H. Wang, H. Guo, E. P. Xing, and H. Zhang, “Mpcformer: fast, performant and private transformer inference with mpc,” arXiv, 2023
work page 2023
-
[52]
Fairness in serving large lan- guage models,
Y . Sheng, S. Cao, D. Li, B. Zhu, Z. Li, D. Zhuo, J. E. Gonzalez, and I. Stoica, “Fairness in serving large lan- guage models,” in USENIX OSDI, 2024
work page 2024
-
[53]
Splitwise: Efficient gener- ative llm inference using phase splitting,
P. Patel, E. Choukse, C. Zhang, A. Shah, Íñigo Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient gener- ative llm inference using phase splitting,” in ACM/IEEE ISCA, 2024
work page 2024
-
[54]
Gradient compression supercharged high- performance data parallel dnn training,
Y . Bai, C. Li, Q. Zhou, J. Yi, P. Gong, F. Yan, R. Chen, and Y . Xu, “Gradient compression supercharged high- performance data parallel dnn training,” in ACM SOSP, 2021
work page 2021
-
[55]
Fine-tuning language models over slow networks using activation quantization with guarantees,
J. Wang, B. Yuan, L. Rimanic, Y . He, T. Dao, B. Chen, C. Re, and C. Zhang, “Fine-tuning language models over slow networks using activation quantization with guarantees,” Neural Information Processing Systems , 2022
work page 2022
-
[56]
Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, “Train big, then compress: Rethink- ing model size for efficient training and inference of transformers,” in International Conference on Machine Learning (ICML), 2020
work page 2020
-
[57]
Smoothquant: Accurate and efficient post-training quan- tization for large language models,
G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quan- tization for large language models,”International Con- ference on Machine Learning, 2022
work page 2022
-
[58]
Gptq: Accurate post-training quantization for generative pre-trained transformers,
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv, 2022
work page 2022
-
[59]
Llm. int8 (): 8-bit matrix multiplication for transformers at scale,
T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Llm. int8 (): 8-bit matrix multiplication for transformers at scale,” arXiv, 2022
work page 2022
-
[60]
SparTA: Deep- Learning model sparsity via Tensor-with-Sparsity- Attribute,
N. Zheng, B. Lin, Q. Zhang, L. Ma, Y . Yang, F. Yang, Y . Wang, M. Yang, and L. Zhou, “SparTA: Deep- Learning model sparsity via Tensor-with-Sparsity- Attribute,” in USENIX OSDI, 2022. 15
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.