Recognition: 3 theorem links
· Lean TheoremEfficient Memory Management for Large Language Model Serving with PagedAttention
Pith reviewed 2026-05-12 14:58 UTC · model grok-4.3
The pith
PagedAttention manages LLM key-value caches like operating-system virtual memory to eliminate fragmentation and allow sharing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PagedAttention is an attention algorithm that stores the key-value cache in non-contiguous blocks managed like virtual-memory pages; on top of it, vLLM achieves near-zero KV-cache waste and flexible intra- and inter-request sharing, delivering 2-4x higher throughput than FasterTransformer or Orca at the same latency.
What carries the argument
PagedAttention, the algorithm that divides the key-value cache into fixed-size blocks (pages) that can be allocated, swapped, and shared independently of contiguous memory layout.
If this is right
- Batch sizes can grow without proportional memory increase, directly raising tokens processed per second.
- Long-context and beam-search workloads become practical on the same hardware because memory is no longer the dominant limit.
- Sharing of cache blocks across requests reduces total memory footprint when prompts overlap.
- Memory usage becomes more predictable, simplifying capacity planning for production serving clusters.
Where Pith is reading between the lines
- The paging abstraction could be reused for other dynamic tensor structures that grow during inference.
- Hardware accelerators might add native support for paged attention to remove the remaining software mapping cost.
- Because the code is open, other serving frameworks can adopt the same block layout without reimplementing the attention kernel.
Load-bearing premise
That translating the key-value cache into paged blocks adds negligible cost to attention arithmetic and produces identical model outputs on every workload.
What would settle it
Measure KV-cache memory utilization and exact token outputs on a benchmark with highly variable sequence lengths; if utilization stays far above zero waste or any output token differs from a non-paged baseline, the central claim does not hold.
read the original abstract
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PagedAttention, an attention algorithm modeled on OS paging to manage the dynamic, per-request key-value cache during LLM inference. By organizing the KV cache into fixed-size blocks with a page table, vLLM achieves near-zero fragmentation and enables KV cache sharing within and across requests. Empirical results on standard models claim 2-4× higher throughput than FasterTransformer and Orca at equivalent latency, with larger gains for long sequences, bigger models, and complex decoding.
Significance. If the throughput claims hold under the reported conditions, the work is significant for production LLM serving: it directly attacks the memory-fragmentation bottleneck that limits batch size, potentially lowering inference cost and latency for long-context workloads. Public code release aids reproducibility and adoption.
major comments (1)
- [§5] §5 (Evaluation): the 2-4× throughput numbers are obtained by comparing against external systems; the manuscript does not isolate the incremental latency cost of PagedAttention’s page-table indirection and non-contiguous loads inside the fused attention kernels (e.g., via a same-batch-size contiguous-cache baseline). Without this measurement it remains possible that the reported gains are partly offset by reduced GPU utilization at the larger batch sizes enabled by reduced fragmentation.
minor comments (2)
- [Abstract, §5] Abstract and §5: benchmark configurations (sequence lengths, batch sizes, hardware, exact model variants) are summarized but not tabulated; adding a concise table would improve clarity.
- [§3] §3: the description of block allocation and page-table lookup is clear at a high level but does not specify the exact data structures or cache-line effects inside the CUDA kernels; a short pseudocode listing would help readers replicate the implementation.
Simulated Author's Rebuttal
We thank the referee for the positive summary and recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [§5] §5 (Evaluation): the 2-4× throughput numbers are obtained by comparing against external systems; the manuscript does not isolate the incremental latency cost of PagedAttention’s page-table indirection and non-contiguous loads inside the fused attention kernels (e.g., via a same-batch-size contiguous-cache baseline). Without this measurement it remains possible that the reported gains are partly offset by reduced GPU utilization at the larger batch sizes enabled by reduced fragmentation.
Authors: We agree that an intra-system ablation isolating the page-table and non-contiguous access overhead would strengthen the evaluation. Our end-to-end results compare against external baselines because that is the relevant metric for practitioners; the reported throughput gains arise primarily from the larger batch sizes made possible by near-zero fragmentation. Nevertheless, the concern is valid: any kernel-level slowdown could partially offset those gains at scale. We will revise §5 to include a same-batch-size contiguous-cache baseline inside vLLM (by temporarily disabling the page table and forcing contiguous allocation) and report the resulting latency difference for the attention kernels. This addition will quantify the incremental cost directly. revision: yes
Circularity Check
No significant circularity in PagedAttention derivation or claims
full rationale
The paper proposes PagedAttention as an OS-paging-inspired algorithm for KV-cache management, implements it in vLLM, and supports its 2-4x throughput claims via direct empirical benchmarks against independent external systems (FasterTransformer, Orca). No mathematical derivations, fitted parameters presented as predictions, self-definitional equations, or load-bearing self-citations appear in the abstract or described chain. All performance results are externally falsifiable measurements rather than reductions to internal inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPU memory allocation can be performed at fine granularity with low overhead for attention operations.
invented entities (1)
-
PagedAttention
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
vLLM improves the throughput of popular LLMs by 2-4× with the same level of latency compared to the state-of-the-art systems
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the KV cache memory for each request is huge and grows and shrinks dynamically
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 37 Pith papers
-
MeMo: Memory as a Model
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
-
Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards
Fine-tuned LLMs trained with reinforcement learning using verifiable rewards produce floor plans that satisfy connectivity and numerical constraints, outperforming prior methods with at least 94% relative improvement ...
-
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
-
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
-
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
-
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.
-
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
Neural Garbage Collection: Learning to Forget while Learning to Reason
Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
-
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
-
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search
R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.
-
Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels
Integrated solar-compute-radiator panels enable orbital satellites to achieve over 100 kW of AI inference compute per metric ton launched, supporting thousands of simultaneous large language model sessions.
-
MemFactory: Unified Inference & Training Framework for Agent Memory
MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
An Executable Benchmarking Suite for Tool-Using Agents
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
-
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
-
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
EdgeFM: Efficient Edge Inference for Vision-Language Models
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
-
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
-
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference
Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
Hierarchical vs. Flat Iteration in Shared-Weight Transformers
Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.
-
Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation
An isolation-first on-premise architecture for open-weights LLMs in radiology achieved regulatory approval for processing PHI and showed good utility for text-anchored tasks in a one-week pilot with 22 users.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...
Reference graph
Works this paper leans on
-
[1]
Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, et al. 2022. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv preprint arXiv:2207.00032 (2022)
-
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information process- ing systems 13 (2000)
work page 2000
-
[4]
Ond rej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Gra- ham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Car- olina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings ...
work page 2016
-
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al . 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901
work page 2020
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys. org/blog/2023-03-30-vicuna/
work page 2023
-
[9]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling lan- guage modeling with pathways.arXiv preprint arXiv:2204.02311 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency- aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491
work page 2020
-
[11]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) . 613–627
work page 2017
-
[12]
Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. 2022. DVABatch: Diversity-aware Multi- Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 183–198
work page 2022
-
[13]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
-
[14]
Advances in Neural Information Processing Systems 35 (2022), 16344–16359
Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359
work page 2022
-
[15]
Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTrans- formers: an efficient GPU serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 389–402
work page 2021
-
[16]
FastAPI. 2023. FastAPI. https://github.com/tiangolo/fastapi
work page 2023
-
[17]
Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference. 1–15
work page 2018
-
[18]
Amir Gholami, Zhewei Yao, Sehoon Kim, Michael W Mahoney, and Kurt Keutzer. 2021. Ai and memory wall.RiseLab Medium Post 1 (2021), 6
work page 2021
-
[19]
Github. 2022. https://github.com/features/copilot
work page 2022
-
[20]
Google. 2023. https://bard.google.com/
work page 2023
-
[21]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kauf- mann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving{DNNs} like Clockwork: Performance Predictability from the Bottom Up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443–462
work page 2020
-
[22]
Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen
-
[23]
In 16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22)
Microsecond-scale Preemption for Concurrent {GPU- accelerated} {DNN} Inferences. In 16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22) . 539–558
-
[24]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778
work page 2016
-
[25]
Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. Swapadvisor: Push- ing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Archi- tectural Support for Programming Languages and Operating Systems . 1341–1355
work page 2020
-
[26]
Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Check- mate: Breaking the memory wall with optimal tensor rematerialization. 14 Proceedings of Machine Learning and Systems 2 (2020), 497–511
work page 2020
-
[27]
Tom Kilburn, David BG Edwards, Michael J Lanigan, and Frank H Sumner. 1962. One-level storage system. IRE Transactions on Electronic Computers 2 (1962), 223–235
work page 1962
-
[28]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing contin- uous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [30]
-
[31]
Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with rtasks. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation . 881–897
work page 2020
-
[32]
NVIDIA. [n. d.]. Triton Inference Server. https://developer.nvidia.com/ nvidia-triton-inference-server
-
[33]
NVIDIA. 2023. FasterTransformer. https://github.com/NVIDIA/ FasterTransformer
work page 2023
-
[34]
NVIDIA. 2023. NCCL: The NVIDIA Collective Communication Library. https://developer.nvidia.com/nccl
work page 2023
-
[35]
Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke
-
[36]
TensorFlow-Serving: Flexible, High-Performance ML Serving
Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017)
work page Pith review arXiv 2017
-
[37]
OpenAI. 2020. https://openai.com/blog/openai-api
work page 2020
-
[38]
OpenAI. 2022. https://openai.com/blog/chatgpt
work page 2022
-
[39]
OpenAI. 2023. https://openai.com/blog/custom-instructions-for- chatgpt
work page 2023
-
[40]
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
LMSYS ORG. 2023. Chatbot Arena Leaderboard Week 8: Introduc- ing MT-Bench and Vicuna-33B. https://lmsys.org/blog/2023-06-22- leaderboard/
work page 2023
-
[42]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al . 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural informa- tion processing systems 32 (2019)
work page 2019
-
[43]
Shishir G Patil, Paras Jain, Prabal Dutta, Ion Stoica, and Joseph Gon- zalez. 2022. POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging. In International Conference on Machine Learning. PMLR, 17573–17583
work page 2022
- [44]
-
[45]
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He
-
[46]
In USENIX Annual Technical Conference
ZeRO-Offload: Democratizing Billion-Scale Model Training.. In USENIX Annual Technical Conference. 551–564
-
[47]
Reuters. 2023. https://www.reuters.com/technology/tech-giants-ai- like-bing-bard-poses-billion-dollar-search-problem-2023-02-22/
work page 2023
-
[48]
Amazon Web Services. 2023. https://aws.amazon.com/bedrock/
work page 2023
-
[49]
Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU cluster engine for accelerating DNN-based video anal- ysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322–337
work page 2019
- [50]
-
[51]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [52]
-
[53]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to se- quence learning with neural networks. Advances in neural information processing systems 27 (2014)
work page 2014
- [54]
-
[55]
ShareGPT Team. 2023. https://sharegpt.com/
work page 2023
-
[56]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. Advances in neural information processing systems 30 (2017)
work page 2017
-
[58]
Jing Wang, Youyou Lu, Qing Wang, Minhui Xie, Keji Huang, and Jiwu Shu. 2022. Pacman: An Efficient Compaction Approach for {Log- Structured} {Key-Value} Store on Persistent Memory. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 773–788
work page 2022
-
[59]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuai- wen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dy- namic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming . 41–53
work page 2018
-
[60]
Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, and Lei Li
-
[61]
LightSeq: A High Performance Inference Library for Transform- ers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies: Industry Papers. 113–120
work page 2021
-
[62]
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560 (2022)
work page internal anchor Pith review arXiv 2022
-
[63]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations . 38–45
work page 2020
-
[64]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
work page internal anchor Pith review arXiv 2016
-
[65]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for {Transformer-Based} Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 521–538
work page 2022
-
[66]
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) . USENIX As- sociation, Boston, MA, 787–808. https://www.usenix.org/conference/ nsdi23/presentation/zhang-hong 15
work page 2023
-
[67]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 559–578
work page 2022
-
[69]
Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. 2022. PetS: A Unified Framework for Parameter-Efficient Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 489–504. 16
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.