Cache-Resident LLM Inference in GB-Scale Last-Level Caches

Ceyu Xu; Jian Weng; Marco Canini; Tongzhou Gu; Wanning Zhang

arxiv: 2606.25353 · v1 · pith:VGRENWABnew · submitted 2026-06-24 · 💻 cs.AR

Cache-Resident LLM Inference in GB-Scale Last-Level Caches

Wanning Zhang , Tongzhou Gu , Marco Canini , Ceyu Xu , Jian Weng This is my paper

Pith reviewed 2026-06-25 20:14 UTC · model grok-4.3

classification 💻 cs.AR

keywords LLM inferencelast-level cachecache-resident executionKV cacheCPU clusterdata movementinference optimizationweight residency

0 comments

The pith

A cache-resident execution model keeps LLM weights in GB-scale CPU caches by separating weight operators from attention, delivering up to 13.9x faster token generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that LLM inference can avoid repeated memory fetches by organizing computation so that reusable weights stay inside large on-chip last-level caches. It achieves this through a model that puts weight-centric operators in dedicated domains separate from attention and KV-cache work, while relaxing synchronization to sub-operator dependencies. A reader would care because the approach shows commodity server CPUs could handle inference workloads efficiently without specialized accelerators or extreme memory bandwidth. The prototype reports 2x to 11x speedups on real Llama models and modeled gains up to nearly 14x under varied conditions.

Core claim

The paper claims that a cache-resident execution model for hierarchical-memory clustered systems, which separates weight-centric operators from attention and KV-cache management into dedicated resource domains while relaxing synchronization from operator boundaries to true sub-operator dependencies, keeps reusable weights cache-resident and scales KV capacity independently, enabling measured 2.04x-11.51x TPOT speedups on Llama-3.2-3B and Llama-2-7B over equally provisioned llama.cpp and up to 13.9x under a validated analytical model.

What carries the argument

The weight-attention decoupled architecture that places weight operators in separate resource domains from attention and KV management while using locality-aware placement and a specialized static runtime.

If this is right

Time-per-output-token improves 2.04x-11.51x on deployed Llama-3.2-3B and Llama-2-7B configurations.
An analytical model projects up to 13.9x TPOT gains across model sizes, context lengths, and batch sizes.
Commodity CPUs with GB-scale last-level caches support efficient LLM inference when organized around cache residency and decoupled state management.
Relaxed synchronization reduces coordination overhead once operators become cache-resident.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decoupling pattern could be applied to other memory-bound applications on clustered CPUs to reduce data movement.
Future CPUs with even larger caches might achieve higher speedups if the resource-domain separation is preserved.
Production inference systems may need to adopt similar static runtimes to exploit cache residency on multi-socket hardware.

Load-bearing premise

Separating weight-centric operators into dedicated resource domains while scaling KV capacity independently will keep weights cache-resident without the increased in-flight requests and KV footprint negating the latency and bandwidth benefits.

What would settle it

Running the prototype on longer contexts or larger batches where the KV footprint forces weight evictions from the LLC and measuring whether TPOT speedup falls below 1x relative to llama.cpp.

Figures

Figures reproduced from arXiv: 2606.25353 by Ceyu Xu, Jian Weng, Marco Canini, Tongzhou Gu, Wanning Zhang.

**Figure 1.** Figure 1: Transformer-based LLM decodes each token by propagating the embeddings across operators. achieves 2.04×–11.51× TPOT speedup. Under model-based extrapolation, it further reaches up to 12.5× throughput speedup and up to 13.9× TPOT speedup across model sizes, context lengths, and batch sizes of Llama-family models. 2 Background & Motivation: LLM Inference In this section, we first characterize transformer-bas… view at source ↗

**Figure 2.** Figure 2: FLOPs/Byte of Llama-2 3B & 7B during decoding with context length 4096. it is the long-running steady state and dominates execution time [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Tensor and pipeline parallelism (TP & PP) scaling 2.3 Performance Scaling We analyze performance scaling [20, 30] on the clustered system above using two widely adopted techniques, tensor parallelism (TP) and pipeline parallelism (PP), as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between weight–KV colocation and weight–attention decoupling As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Sub-operator coordination CPU2). We treat each CPU socket as a node in the architectural sense defined in §2.2. As discussed in §3.1, The execution model partitions each transformer layer into two sequential stages: (i) weight-centric operators (QKV projection and FFN), and (ii) attention-centric operators. CPU1 is designated as the weight node, and executes weight-centric operators, while CPU2 is desig… view at source ↗

**Figure 7.** Figure 7: Prototyping Hardware Platform To improve locality further, weight shards are partitioned and initialized in a bank-aware manner so that each core predominantly accesses LLC slices that are physically close to it. This placement is established during cache warmup at system initialization— when each core first touches its assigned weight shard, that first touch determines the shard’s home cache bank, which … view at source ↗

**Figure 8.** Figure 8: Model-based throughput and TPOT speedups over llama.cpp across context lengths and batch sizes for four models. The speedup is on top of each cell, and the corresponding absolute throughput (tokens/s) or TPOT (ms) is at the bottom. and tail latency in user-facing serving systems. This makes the small-batch regime (1–8) the practically relevant operating point for latency-sensitive serving, and [PITH_FUL… view at source ↗

**Figure 9.** Figure 9: Effect of weight-attention separation on per-transformer-block latency in Llama-family models. The speedup is on top of each cell, and the absolute time saving per transformer block (𝜇s) is at the bottom. 1K 2K 4K 1.47x -37.0us 1.32x -38.5us 1.16x -39.0us 1.11x -41.1us 1.05x -31.1us 1.06x -64.3us 1.56x -53.2us 1.26x -40.0us 1.12x -41.2us 1.08x -46.8us 1.07x -69.4us 1.02x -41.6us 1.33x -44.3us 1.18x -50.0us… view at source ↗

**Figure 10.** Figure 10: Customized thread pool versus OpenMP in a single-socket, per-transformer-block comparison. The speedup is on top of each cell, and the absolute time saving per transformer block (𝜇s) is at the bottom. support for the underlying cache-pressure hypothesis, but with moderate and workload-dependent gains rather than a universal large win. 7 Discussion 7.1 Related Works Dedicated Resource Allocations: Prior s… view at source ↗

**Figure 11.** Figure 11: Llama-2-70B at ctx=4096 weight attention separation breakdown. The left bars are single-socket (=1), and the right bars are weight-attention separated. match distinct compute and memory characteristics on GPUs. In contrast, this work approaches this issue through weight– attention separation to specialize for the different compute and memory patterns. KV-cache Management: Recent systems optimize inferen… view at source ↗

read the original abstract

Large language model (LLM) inference is increasingly dominated by data movement across the memory hierarchy. Recent 3D-stacked cache technologies have enabled GB-scale last-level caches in modern server CPUs, making it possible to keep reusable model weights on chip and exploit cache bandwidth and latency. Achieving this regime is not straightforward: deeper pipelining for weight residency increases in-flight requests and KV-cache footprint, while cache-resident operators make operator-boundary synchronization a visible bottleneck. We present a cache-resident execution model for inference on hierarchical-memory clustered systems. The model separates weight-centric operators from attention and KV-cache management into dedicated resource domains, keeping reusable weights cache-resident while scaling KV capacity independently of pipeline depth. It also relaxes synchronization from operator boundaries to true sub-operator dependencies, reducing coordination overhead in the cache-resident regime. We instantiate this model on a multi-socket CPU cluster with a weight-attention decoupled architecture, locality-aware placement, and a specialized static runtime. The prototype substantially outperforms equally provisioned llama.cpp. On deployed Llama-3.2-3B and Llama-2-7B configurations, it achieves 2.04x-11.51x speedup on time-per-output-token (TPOT). Under a validated analytical model, it further reaches up to 13.9x TPOT speedup across model sizes, context lengths, and batch sizes. These results show that commodity CPUs with GB-scale last-level caches can support efficient LLM inference when execution is organized around cache residency, decoupled state management, and dependency-aware coordination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets 2-11x TPOT gains over llama.cpp on small Llamas by decoupling weights from attention/KV and relaxing sync, but the gains rest on weights staying resident as KV and in-flight work grow.

read the letter

The main thing to know is that this work organizes LLM inference so weights stay in GB-scale LLCs on CPUs. They split weight-centric operators into their own domains, scale KV independently, and relax synchronization to sub-operator dependencies instead of full operator boundaries.

What is new is that specific combination for the cache-resident regime, plus the locality-aware placement and static runtime on a multi-socket cluster. The prototype delivers measured 2.04x-11.51x TPOT speedups on Llama-3.2-3B and Llama-2-7B versus equally provisioned llama.cpp, with an analytical model projecting up to 13.9x across more sizes and batches.

The paper does well by shipping concrete hardware numbers rather than just modeling. The comparison baseline is fair and the execution model directly targets the pipelining tension the abstract describes.

The soft spot is the load-bearing assumption that the decoupling keeps weights cache-resident. Deeper pipelining increases in-flight requests and KV footprint; if either forces weight eviction, the latency and bandwidth edge disappears. The reported results are for configurations where residency holds, but without explicit data on weight miss rates or LLC occupancy versus pipeline depth or batch size, it is unclear how far the gains generalize.

This is for people working on CPU inference runtimes or cache-aware systems. A reader focused on non-GPU options for small-to-medium models will find usable ideas and numbers.

It deserves peer review. The measured results and clear technical approach are enough to justify referee time, even if the residency limits need more scrutiny.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a cache-resident execution model for LLM inference on CPUs featuring GB-scale last-level caches. By decoupling weight-centric operators into dedicated resource domains from attention and KV-cache management, the approach aims to maintain model weights in cache while independently scaling KV capacity. A prototype implementation on a multi-socket CPU cluster with locality-aware placement and static runtime demonstrates 2.04x to 11.51x speedups in time-per-output-token (TPOT) over llama.cpp for Llama-3.2-3B and Llama-2-7B models. An analytical model extends this to up to 13.9x speedup across model sizes, context lengths, and batch sizes.

Significance. If the central claims hold, the work shows that commodity server CPUs can perform efficient LLM inference by leveraging large on-chip caches through careful execution organization, decoupling of state, and relaxed synchronization. This provides an alternative to GPU-centric approaches for certain inference workloads and highlights the potential of 3D-stacked cache technologies. The empirical speedups on real hardware are a strength.

major comments (2)

[Abstract] Abstract: the claim of a 'validated analytical model' supporting up to 13.9x TPOT speedup provides no equations, validation methodology, or data. This makes it impossible to determine whether the predictions are independent of fitted parameters or reduce to the input data, which is load-bearing for the extended performance claims beyond the prototype measurements.
[Abstract] Abstract (execution model): the model assumes that separating weight-centric operators into dedicated domains while scaling KV capacity independently will keep weights cache-resident despite increased in-flight requests and KV footprint from deeper pipelining. No measurements of weight miss rates, actual LLC occupancy, or KV size versus pipeline depth are referenced, which is critical to showing that the 2.04x–11.51x speedups generalize rather than being artifacts of limited configurations.

minor comments (1)

[Abstract] Abstract: the reported speedup range 2.04x-11.51x is stated without mapping specific factors to model/context/batch configurations or including error bars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify opportunities to strengthen the presentation of our analytical model and supporting measurements. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of a 'validated analytical model' supporting up to 13.9x TPOT speedup provides no equations, validation methodology, or data. This makes it impossible to determine whether the predictions are independent of fitted parameters or reduce to the input data, which is load-bearing for the extended performance claims beyond the prototype measurements.

Authors: We agree the abstract is too terse on this point. Section 5 of the manuscript derives the analytical TPOT model from first principles using measured cache bandwidth, pipeline depth, and KV-cache scaling factors, with parameters taken directly from hardware datasheets rather than fitted to results. Validation compares model predictions to the 2.04x–11.51x prototype measurements across all evaluated configurations, reporting mean absolute error below 8%. We will revise the abstract to summarize the model basis, parameter sources, and validation error metric so readers can immediately assess independence from the input data. revision: yes
Referee: [Abstract] Abstract (execution model): the model assumes that separating weight-centric operators into dedicated domains while scaling KV capacity independently will keep weights cache-resident despite increased in-flight requests and KV footprint from deeper pipelining. No measurements of weight miss rates, actual LLC occupancy, or KV size versus pipeline depth are referenced, which is critical to showing that the 2.04x–11.51x speedups generalize rather than being artifacts of limited configurations.

Authors: The reported speedups are end-to-end measurements on real multi-socket hardware for the stated models and batch sizes; the decoupled placement and static runtime were explicitly designed to enforce weight residency. We acknowledge that direct instrumentation of per-operator LLC miss rates and occupancy versus pipeline depth would provide stronger confirmation. In the revised manuscript we will add these hardware-counter measurements (weight miss rate, LLC occupancy, and effective KV footprint) for the evaluated pipeline depths to demonstrate that residency holds and the speedups are not configuration-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity detected from available text

full rationale

The provided abstract and description reference a 'validated analytical model' for extrapolated speedups but contain no equations, derivations, parameter fits, or self-citations that could be inspected for reduction to inputs by construction. No load-bearing steps matching the enumerated patterns are quotable, so the derivation chain cannot be shown to collapse; the prototype results and model are treated as independent of the inputs in the given material.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or explicit assumptions, so the ledger cannot be populated beyond noting the absence of information.

pith-pipeline@v0.9.1-grok · 5817 in / 1148 out tokens · 28592 ms · 2026-06-25T20:14:43.913032+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references

[1]

Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: a system for lar...

2016
[2]

Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 24) , pages 117–134, Santa Clara, CA, July 2024. USENIX Association

2024
[3]

IX: A protected dataplane op- erating system for high throughput and low latency

Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Chris- tos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane op- erating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 49–65. USENIX Association, 2014

2014
[4]

TVM: An auto- mated End-to-End optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An auto- mated End-to-End optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, Carlsbad, CA, O...

2018
[5]

Openmp: An industry- standard api for shared-memory programming

Leonardo Dagum and Ramesh Menon. Openmp: An industry- standard api for shared-memory programming. IEEE Computational Science and Engineering , 5(1):46–55, 1998

1998
[6]

Fu, Stefano Ermon, Atri Rudra, and Christopher Re

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems , volume 35, 2022

2022
[7]

Marathe, and Nir Shavit

David Dice, Virendra J. Marathe, and Nir Shavit. Lock cohorting: a general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 247–256, 2012

2012
[8]

llama.cpp

ggml-org contributors. llama.cpp. https://github.com/ggml- org/llama.cpp. Accessed: 2026-04-01

2026
[9]

Memory migration on next- touch

Brice Goglin and Nathalie Furmento. Memory migration on next- touch. In Proceedings of the Linux Symposium , pages 149–156, 2009

2009
[10]

Groq lpu

Groq. Groq lpu. https://groq.com. Accessed: 2026-04-01

2026
[11]

Applied ma- chine learning at facebook: A datacenter infrastructure perspective

Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. Applied ma- chine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Sympo...

2018
[12]

Waferllm: Large language model infer- ence at wafer scale

Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, and Luo Mai. Waferllm: Large language model infer- ence at wafer scale. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25) . USENIX Association, 2025

2025
[13]

Le, Yonghui Wu, and Zhifeng Chen

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems 32 , pages 103–112, 2019

2019
[14]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. SOSP 2023

2023
[15]

Infini- Gen: Efficient generative inference of large language models with dy- namic KV cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. Infini- Gen: Efficient generative inference of large language models with dy- namic KV cache management. In 18th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 24) , pages 155–172, Santa Clara, CA, July 2024. USENIX Association

2024
[16]

Mellor-Crummey and Michael L

John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems , 9(1):21–65, February 1991

1991
[17]

Devanur, Gregory R

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Se- shadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

2019
[18]

Pytorch: an imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: an imperative style, high-perfor...

2019
[19]

Splitwise: Efficient gen- erative LLM inference using phase splitting, 2023

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient gen- erative LLM inference using phase splitting, 2023. 13 Zhang et al

2023
[20]

Efficiently scaling transformer inference, 2022

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shiv- ani Agrawal, and Jeff Dean. Efficiently scaling transformer inference, 2022

2022
[21]

vattention: Dynamic memory management for serving llms without pagedattention

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ram- jee, and Ashish Panwar. vattention: Dynamic memory management for serving llms without pagedattention. In Proceedings of the 30th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 1 , pages 1133–1150, 2025

2025
[22]

Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks

George Prekas, Marios Kogias, and Edouard Bugnion. Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks. In Pro- ceedings of the 26th Symposium on Operating Systems Principles, pages 325–341, October 2017

2017
[23]

Arachne: Core-Aware thread management

Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ouster- hout. Arachne: Core-Aware thread management. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, October 2018. USENIX Association

2018
[24]

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. SIGPLAN Not. , 48(6):519–530, June 2013

2013
[25]

Fast transformer decoding: One write-head is all you need, 2019

Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019

2019
[26]

FlexGen: High-throughput generative inference of large language models with a single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, and Ce Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of th...

2023
[27]

Megatron-lm: Training multi- billion parameter language models using model parallelism, 2019

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi- billion parameter language models using model parallelism, 2019

2019
[28]

Department of Commerce

U.S. Department of Commerce. Statement on uae and saudi chip exports. https://www.commerce.gov/news/press- releases/2025/11/statement-uae-and-saudi-chip-exports, 2025. Accessed: 2025-11-19

2025
[29]

Atten- tion is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Atten- tion is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017
[30]

Supporting very large models using automatic dataflow graph partitioning

Minjie Wang, Chien-chin Huang, and Jinyang Li. Supporting very large models using automatic dataflow graph partitioning. In Pro- ceedings of the Fourteenth EuroSys Conference 2019 , EuroSys ’19, New York, NY, USA, 2019. Association for Computing Machinery

2019
[31]

T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge

Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang. T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge. In Proceedings of the Twentieth Eu- ropean Conference on Computer Systems , EuroSys ’25, page 278–292, New York, NY, USA, 2025. Association for Computing Machinery

2025
[32]

Smoothquant: accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning , ICML’23. JMLR.org, 2023

2023
[33]

Orca: A distributed serving system for Transformer-Based generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages 521– 538, Carlsbad, CA, July 2022. USENIX Association

2022
[34]

DistServe: Disaggregating pre- fill and decoding for goodput-optimized large language model serv- ing

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating pre- fill and decoding for goodput-optimized large language model serv- ing. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association. 14

2024

[1] [1]

Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: a system for lar...

2016

[2] [2]

Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 24) , pages 117–134, Santa Clara, CA, July 2024. USENIX Association

2024

[3] [3]

IX: A protected dataplane op- erating system for high throughput and low latency

Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Chris- tos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane op- erating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 49–65. USENIX Association, 2014

2014

[4] [4]

TVM: An auto- mated End-to-End optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An auto- mated End-to-End optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, Carlsbad, CA, O...

2018

[5] [5]

Openmp: An industry- standard api for shared-memory programming

Leonardo Dagum and Ramesh Menon. Openmp: An industry- standard api for shared-memory programming. IEEE Computational Science and Engineering , 5(1):46–55, 1998

1998

[6] [6]

Fu, Stefano Ermon, Atri Rudra, and Christopher Re

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems , volume 35, 2022

2022

[7] [7]

Marathe, and Nir Shavit

David Dice, Virendra J. Marathe, and Nir Shavit. Lock cohorting: a general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 247–256, 2012

2012

[8] [8]

llama.cpp

ggml-org contributors. llama.cpp. https://github.com/ggml- org/llama.cpp. Accessed: 2026-04-01

2026

[9] [9]

Memory migration on next- touch

Brice Goglin and Nathalie Furmento. Memory migration on next- touch. In Proceedings of the Linux Symposium , pages 149–156, 2009

2009

[10] [10]

Groq lpu

Groq. Groq lpu. https://groq.com. Accessed: 2026-04-01

2026

[11] [11]

Applied ma- chine learning at facebook: A datacenter infrastructure perspective

Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. Applied ma- chine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Sympo...

2018

[12] [12]

Waferllm: Large language model infer- ence at wafer scale

Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, and Luo Mai. Waferllm: Large language model infer- ence at wafer scale. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25) . USENIX Association, 2025

2025

[13] [13]

Le, Yonghui Wu, and Zhifeng Chen

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems 32 , pages 103–112, 2019

2019

[14] [14]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. SOSP 2023

2023

[15] [15]

Infini- Gen: Efficient generative inference of large language models with dy- namic KV cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. Infini- Gen: Efficient generative inference of large language models with dy- namic KV cache management. In 18th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 24) , pages 155–172, Santa Clara, CA, July 2024. USENIX Association

2024

[16] [16]

Mellor-Crummey and Michael L

John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems , 9(1):21–65, February 1991

1991

[17] [17]

Devanur, Gregory R

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Se- shadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

2019

[18] [18]

Pytorch: an imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: an imperative style, high-perfor...

2019

[19] [19]

Splitwise: Efficient gen- erative LLM inference using phase splitting, 2023

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient gen- erative LLM inference using phase splitting, 2023. 13 Zhang et al

2023

[20] [20]

Efficiently scaling transformer inference, 2022

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shiv- ani Agrawal, and Jeff Dean. Efficiently scaling transformer inference, 2022

2022

[21] [21]

vattention: Dynamic memory management for serving llms without pagedattention

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ram- jee, and Ashish Panwar. vattention: Dynamic memory management for serving llms without pagedattention. In Proceedings of the 30th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 1 , pages 1133–1150, 2025

2025

[22] [22]

Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks

George Prekas, Marios Kogias, and Edouard Bugnion. Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks. In Pro- ceedings of the 26th Symposium on Operating Systems Principles, pages 325–341, October 2017

2017

[23] [23]

Arachne: Core-Aware thread management

Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ouster- hout. Arachne: Core-Aware thread management. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, October 2018. USENIX Association

2018

[24] [24]

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. SIGPLAN Not. , 48(6):519–530, June 2013

2013

[25] [25]

Fast transformer decoding: One write-head is all you need, 2019

Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019

2019

[26] [26]

FlexGen: High-throughput generative inference of large language models with a single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, and Ce Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of th...

2023

[27] [27]

Megatron-lm: Training multi- billion parameter language models using model parallelism, 2019

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi- billion parameter language models using model parallelism, 2019

2019

[28] [28]

Department of Commerce

U.S. Department of Commerce. Statement on uae and saudi chip exports. https://www.commerce.gov/news/press- releases/2025/11/statement-uae-and-saudi-chip-exports, 2025. Accessed: 2025-11-19

2025

[29] [29]

Atten- tion is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Atten- tion is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017

[30] [30]

Supporting very large models using automatic dataflow graph partitioning

Minjie Wang, Chien-chin Huang, and Jinyang Li. Supporting very large models using automatic dataflow graph partitioning. In Pro- ceedings of the Fourteenth EuroSys Conference 2019 , EuroSys ’19, New York, NY, USA, 2019. Association for Computing Machinery

2019

[31] [31]

T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge

Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang. T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge. In Proceedings of the Twentieth Eu- ropean Conference on Computer Systems , EuroSys ’25, page 278–292, New York, NY, USA, 2025. Association for Computing Machinery

2025

[32] [32]

Smoothquant: accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning , ICML’23. JMLR.org, 2023

2023

[33] [33]

Orca: A distributed serving system for Transformer-Based generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages 521– 538, Carlsbad, CA, July 2022. USENIX Association

2022

[34] [34]

DistServe: Disaggregating pre- fill and decoding for goodput-optimized large language model serv- ing

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating pre- fill and decoding for goodput-optimized large language model serv- ing. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association. 14

2024