Bandwidth-Aware LLM Inference on Heterogeneous Many-Core Supercomputers

Bin Han; Depei Qian; Gen Li; Hailong Yang; Jiaxing Qi; Shiqing Ma; Shizhe Shang; Yao Lu; Zhongzhi Luan

arxiv: 2605.25655 · v1 · pith:Z5ZW7SYMnew · submitted 2026-05-25 · 💻 cs.DC

Bandwidth-Aware LLM Inference on Heterogeneous Many-Core Supercomputers

Yao Lu , Zhongzhi Luan , Gen Li , Jiaxing Qi , Shiqing Ma , Bin Han , Shizhe Shang , Hailong Yang

show 1 more author

Depei Qian

This is my paper

Pith reviewed 2026-06-29 20:30 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM inferenceheterogeneous many-corebandwidth-aware optimizationTHInferMT-3000 processorhardware-software co-designPrefill-Buffer-Decode pipelinesupercomputer deployment

0 comments

The pith

THInfer achieves 62-84 percent higher LLM throughput on MT-3000 than DeepSpeed on GPUs by maximizing data locality under bandwidth limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces THInfer as a hardware-aware framework for running large language model inference on processors like the MT-3000 that have constrained main-memory bandwidth and a distributed memory hierarchy. It combines an optimized operator library, graph fusion with staged attention, and a Prefill-Buffer-Decode pipeline with bounded buffers to support hybrid parallelism across clusters. A sympathetic reader would care because the work shows these techniques let the system handle models up to 70B parameters where standard GPU frameworks cannot run, while delivering higher throughput than DeepSpeed baselines on V100S and A800 GPUs. The central claim is that hardware-software co-design can overcome bandwidth bottlenecks that prevent direct migration of existing inference code to many-core supercomputer nodes.

Core claim

THInfer is a hardware-aware inference framework that maximizes data locality under bandwidth-constrained conditions through hardware-software co-design and parallel strategy optimization, incorporating a high-performance VLIW SIMD operator library, density-driven computation graph fusion with unified kernel scheduling and staged pipelined attention, and a Prefill-Buffer-Decode pipeline with bounded buffer management for hybrid parallelism via two-level MPI and hthreads communication; on Llama models it delivers 62-73 percent higher throughput than DeepSpeed on two V100S GPUs and 67-84 percent higher than on A800 GPUs for the 7B case, with comparable or better results at 13B and 30B, plus sta

What carries the argument

The Prefill-Buffer-Decode (P-B-D) pipeline with bounded buffer management, combined with density-driven graph fusion and a hand-optimized FP16 VLIW SIMD operator library that reaches up to 70 percent of peak per cluster.

If this is right

THInfer enables stable inference on 70B models on the MT-3000 where typical GPU frameworks cannot run under the same conditions.
The two-level communication strategy using MPI and hthreads supports efficient multi-cluster collaboration for larger models.
The staged pipelined attention fusion and unified kernel scheduling reduce latency and improve scalability on bandwidth-limited hardware.
The operator library and graph fusion techniques allow the framework to reach 70 percent of peak performance per cluster on the VLIW SIMD architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same co-design pattern could be tested on other many-core processors that share the MT-3000's bandwidth and memory-hierarchy constraints.
If the P-B-D pipeline generalizes, it might reduce the need for specialized GPU clusters when deploying LLMs on existing supercomputers.
Extending the density-driven fusion to additional operators could further improve performance on even larger models without increasing hardware requirements.

Load-bearing premise

The reported speedups assume that the DeepSpeed GPU baselines use equivalent model precision, batch sizes, and input lengths as the MT-3000 measurements.

What would settle it

A side-by-side run of the same Llama 7B workload with identical precision, batch size, and sequence length on both the MT-3000 under THInfer and two V100S GPUs under DeepSpeed that shows THInfer throughput at or below the GPU result.

Figures

Figures reproduced from arXiv: 2605.25655 by Bin Han, Depei Qian, Gen Li, Hailong Yang, Jiaxing Qi, Shiqing Ma, Shizhe Shang, Yao Lu, Zhongzhi Luan.

**Figure 1.** Figure 1: GPU vs. MT-3000 Memory Hierarchy batching, memory optimization, and quantization. However, their designs heavily rely on high-bandwidth unified memory architectures (such as the 900 GB/s memory bandwidth of NVIDIA V100), making them difficult to adapt to specific high-performance computing systems. Therefore, this paper targets the Tianhe New-Generation supercomputers, aiming to develop an efficient and fl… view at source ↗

**Figure 2.** Figure 2: Illustration of the MT-3000 system. B. Classical LLMs Since the introduction of the Transformer architecture [11], its powerful contextual modeling and cross-task generalization capabilities have driven a paradigm shift in artificial intelligence. In natural language processing (NLP), BERT [12], based on bidirectional masked language modeling, broke through semantic representation bottlenecks with 340 mil… view at source ↗

**Figure 3.** Figure 3: THInfer: Co-Designing Operators, Graph-Level Algorithms, and System-Level Adaptive Parallelism for MT-3000 LLM Inference [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Computational Flowchart of LLM inference [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Data Flow Graph for GEMM Operator Computation: Three-level [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Before Tiling: Unbalanced transmission load of [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: After Tiling: Y2 input and output are tiled to balance the transmission load TABLE I INSTRUCTION PIPELINE SCHEDULING FOR MATRIX MULTIPLICATION KERNEL VMAC SMAC SLDST VLDST SIEU vfmulas32 X1[0, 1, 2][0], W1[0][0] – – vldw W1[1][2, 3] – vfmulas32 X1[0, 1, 2][0], W1[0][1] – – – sbale X1[0][1] vfmulas32 X1[0, 1, 2][0], W1[0][2] svbcast X1[0][1] sldh X1next[0][0] – sbale X1[1][1] vfmulas32 X1[0, 1, 2][0], W1[0]… view at source ↗

**Figure 8.** Figure 8: MT Attention: Staged Pipeline-Based Fusion Optimization Strategy [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Hardware-Aware Reduction Algorithm: (1) Use the [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: P-B-D Three-Level Synchronous Pipeline This pipeline divides the cluster into three logical pools based on functionality: • Prefill Pool: Handles incoming requests, performs prefill computation, and generates the first token along with its corresponding initial KV cache. This pool employs Data Parallelism (DP) and Pipeline Parallelism (PP) strategies to fully exploit GEMM computational throughput. • Buffe… view at source ↗

**Figure 11.** Figure 11: Performance utilization and theoretical peak analysis for the Linear [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

Large language model (LLM) inference is limited by high computational cost and memory bandwidth demands, making deployment on heterogeneous many-core processors challenging. Taking the MT-3000 processor used in the Tianhe supercomputer as an example, its limited main-memory bandwidth and distributed memory hierarchy exemplify these bottlenecks, making it difficult to directly migrate existing GPU-based inference frameworks. To address this problem, we propose THInfer, a hardware-aware inference framework that maximizes data locality under bandwidth-constrained conditions through hardware-software co-design and parallel strategy optimization. THInfer incorporates three key techniques: (1) a high-performance operator library for the VLIW SIMD architecture, providing hand-optimized FP16 kernels that achieve up to 70 percent of the peak performance per cluster; (2) a density-driven computation graph fusion and unified kernel scheduling mechanism, combined with a staged pipelined attention fusion method; and (3) a Prefill-Buffer-Decode (P-B-D) pipeline and bounded buffer management strategy, which supports hybrid parallelism and enables efficient multi-cluster collaboration through two-level communication based on MPI and hthreads. Experiments on the Llama model series show that THInfer improves throughput on the 7B model by 62 percent to 73 percent over DeepSpeed on two V100S GPUs and by 67 percent to 84 percent over the A800 GPU. The 13B and 30B models also demonstrate comparable or better performance. Moreover, THInfer maintains stable performance on the 70B model, whereas typical GPU-based frameworks fail to run under the same setting. Overall, THInfer significantly enhances throughput, reduces latency, and improves scalability, providing a feasible system solution for efficient and scalable LLM inference on heterogeneous many-core architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

THInfer adapts known inference optimizations to the MT-3000 but the GPU comparisons lack enough setup details to fully trust the speedups.

read the letter

The main thing here is that THInfer ports standard LLM inference techniques—hand-tuned FP16 kernels, graph fusion, and a prefill-buffer-decode pipeline—to the MT-3000 VLIW processor and its memory hierarchy, with claims of 62-84% higher throughput than DeepSpeed on V100S or A800 GPUs for 7B Llama and stable runs on 70B where the GPU baselines fail.

What is actually new is the hardware-specific tuning: the operator library targeting the VLIW SIMD clusters, the density-driven fusion plus staged attention, and the two-level MPI/hthreads communication for multi-cluster scaling. These are straightforward adaptations rather than new algorithms, but they address the bandwidth and distributed-memory constraints that make direct GPU-framework ports fail on this machine.

The paper does a reasonable job showing that many-core supercomputer nodes can be made usable for inference without new silicon. The 70% of peak performance per cluster and the hybrid parallelism strategy are concrete engineering steps that could matter for national computing centers.

The soft spot is the experimental comparison. The abstract states the throughput numbers but gives no table or text on batch size, input/output lengths, or numeric precision across platforms. If those parameters differ even modestly, the reported gains cannot be cleanly attributed to THInfer. The stress-test concern lands here; the full paper needs to document the controls explicitly or the advantage remains unverified.

This is for people who run LLMs on existing heterogeneous supercomputers rather than pure GPU clusters. A reader working on hardware-specific systems would find the implementation choices useful.

I would send it to peer review. The problem is real and the approach is practical enough to warrant checking the details.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces THInfer, a hardware-aware LLM inference framework for the MT-3000 processor in the Tianhe supercomputer. It describes three techniques: (1) a hand-optimized FP16 operator library for the VLIW SIMD architecture achieving up to 70% of peak per cluster, (2) density-driven graph fusion with unified scheduling and staged pipelined attention, and (3) a Prefill-Buffer-Decode (P-B-D) pipeline with bounded buffers supporting hybrid parallelism via MPI/hthreads two-level communication. The central empirical claim is that THInfer delivers 62-84% higher throughput than DeepSpeed on V100S/A800 GPUs for Llama-7B (with comparable or better results for 13B/30B and stable 70B performance where GPU frameworks fail).

Significance. If the throughput claims hold under matched workloads, the work is significant for demonstrating practical LLM inference on bandwidth-constrained many-core heterogeneous systems outside the GPU ecosystem. The explicit co-design of kernels, fusion, and pipelining for the MT-3000 memory hierarchy provides a concrete template that could be adapted to other supercomputer architectures.

major comments (1)

[Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: The headline throughput claims (62–73% over two V100S GPUs and 67–84% over A800 for the 7B model) are presented without any tabulated values for batch size, input/output sequence lengths, numeric precision, or measurement methodology on either platform. This directly undermines attribution of the gains to THInfer’s techniques rather than possible mismatches in workload parameters, making the central empirical claim unverifiable from the provided information.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate the requested details in the revision.

read point-by-point responses

Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: The headline throughput claims (62–73% over two V100S GPUs and 67–84% over A800 for the 7B model) are presented without any tabulated values for batch size, input/output sequence lengths, numeric precision, or measurement methodology on either platform. This directly undermines attribution of the gains to THInfer’s techniques rather than possible mismatches in workload parameters, making the central empirical claim unverifiable from the provided information.

Authors: We agree that the current presentation of the headline claims lacks sufficient experimental parameters to allow direct verification. In the revised manuscript we will add an explicit table (or expanded subsection) in the Experimental Evaluation section that reports, for each model size and platform: batch size, input/output sequence lengths, numeric precision, and the precise measurement methodology (including timing method and hardware configuration) used for both THInfer and the DeepSpeed baselines. This addition will make the throughput comparisons fully reproducible and will strengthen attribution of the observed gains to the described co-design techniques. revision: yes

Circularity Check

0 steps flagged

No circularity; all claims are empirical measurements without derivations or self-referential reductions

full rationale

The paper presents THInfer as a hardware-aware framework with three described techniques (operator library, graph fusion, P-B-D pipeline) and reports empirical throughput gains on Llama models versus DeepSpeed baselines. The full text contains no equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations. Central claims rest on direct measurements under stated conditions rather than any reduction to inputs by construction. This is the expected outcome for a systems paper focused on implementation and benchmarking.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering systems paper with no mathematical derivations, fitted constants, or postulated entities; the contributions consist of implementation choices and empirical measurements.

pith-pipeline@v0.9.1-grok · 5870 in / 1327 out tokens · 53473 ms · 2026-06-29T20:30:59.688607+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 15 canonical work pages · 14 internal anchors

[1]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023. PREPRINT 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

2023
[5]

TensorRT-LLM,

NVIDIA, “TensorRT-LLM,” https://github.com/NVIDIA/ TensorRT-LLM
[6]

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale,

R. Y . Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasleyet al., “Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale,” inSC22: International Conference for High Performance Com- puting, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15

2022
[7]

Large- scale parallelization and optimization of lattice qcd on tianhe new generation supercomputer,

J. Chen, C. Liu, Z. Luana, M. Gong, Q. Li, and D. Qian, “Large- scale parallelization and optimization of lattice qcd on tianhe new generation supercomputer,” in2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/Sm...

2023
[8]

Mt-3000: a heterogeneous multi-zone processor for hpc,

K. Lu, Y . Wang, Y . Guo, C. Huang, S. Liu, R. Wang, J. Fang, T. Tang, Z. Chen, B. Liuet al., “Mt-3000: a heterogeneous multi-zone processor for hpc,”CCF Transactions on High Performance Computing, vol. 4, no. 2, pp. 150–164, 2022

2022
[9]

Performance analysis of cuda, openacc and openmp programming models on tesla v100 gpu,

M. Khalilov and A. Timoveev, “Performance analysis of cuda, openacc and openmp programming models on tesla v100 gpu,” inJournal of Physics: Conference Series, vol. 1740, no. 1. IOP Publishing, 2021, p. 012056

2021
[10]

Mpi: a standard message passing interface,

D. W. Walker and J. J. Dongarra, “Mpi: a standard message passing interface,”Supercomputer, vol. 12, pp. 56–68, 1996

1996
[11]

Attention is all you need,

A. Vaswani, “Attention is all you need,”Advances in Neural Information Processing Systems, 2017

2017
[12]

Bert: a review of applications in natural language processing and understanding,

M. V . Koroteev, “Bert: a review of applications in natural language processing and understanding,”arXiv preprint arXiv:2103.11943, 2021

work page arXiv 2021
[13]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

2019
[14]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[15]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Qwen2 Technical Report

Q. Team, “Qwen2 technical report,”arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,”arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of Machine Learning and Systems, vol. 6, pp. 87–100, 2024

2024
[21]

Sparsegpt: Massive language models can be accurately pruned in one-shot,

E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 323–10 337

2023
[22]

Llm-pruner: On the structural pruning of large language models,

X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,”Advances in neural information processing systems, vol. 36, pp. 21 702–21 720, 2023

2023
[23]

MiniLLM: On-Policy Distillation of Large Language Models

Y . Gu, L. Dong, F. Wei, and M. Huang, “Minillm: Knowledge distillation of large language models,”arXiv preprint arXiv:2306.08543, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Flashattention: Fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022

2022
[25]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[26]

Linformer: Self-Attention with Linear Complexity

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,”arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[27]

Reformer: The Efficient Transformer

N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient trans- former,”arXiv preprint arXiv:2001.04451, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[28]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,”arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,

Z. Liu, A. Desai, F. Liao, W. Wang, V . Xie, Z. Xu, A. Kyrillidis, and A. Shrivastava, “Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,”Advances in Neural Information Processing Systems, vol. 36, pp. 52 342–52 364, 2023

2023
[30]

Efficient Streaming Language Models with Attention Sinks

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,”arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Orca: A distributed serving system for{Transformer-Based}generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for{Transformer-Based}generative models,” in16th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 22), 2022, pp. 521–538

2022
[32]

Text Generation Inference,

Hugging Face, “Text Generation Inference,” https://github.com/ huggingface/text-generation-inference
[33]

Flexgen: High-throughput generative inference of large language models with a single gpu,

Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. R ´e, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 31 094–31 116

2023
[34]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving,” in18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 193–210

2024
[35]

Efficient processing of deep neural networks: A tutorial and survey,

V . Sze, Y .-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,”Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017

2017
[36]

Roofline: an insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

2009
[37]

Optimizing general matrix multiplications on modern multi-core dsps,

K. Yu, X. Qi, P. Zhang, J. Fang, D. Dong, R. Wang, T. Tang, C. Huang, Y . Che, and Z. Wang, “Optimizing general matrix multiplications on modern multi-core dsps,” in2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2024, pp. 964– 975. Yao Luborn in 1998, PhD candidate with Beihang University, Beijing China. His main rese...

2024
[38]

He served as the chief scientist of China National High Technology Program on high perfor- mance computing for 20 years

He is a professor with the School of Com- puter Science and Engineering, Beihang University, China. He served as the chief scientist of China National High Technology Program on high perfor- mance computing for 20 years. His research inter- ests include innovative technologies in distributed computing, high performance computing, and com- puter architectu...

[1] [1]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023. PREPRINT 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

2023

[5] [5]

TensorRT-LLM,

NVIDIA, “TensorRT-LLM,” https://github.com/NVIDIA/ TensorRT-LLM

[6] [6]

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale,

R. Y . Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasleyet al., “Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale,” inSC22: International Conference for High Performance Com- puting, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15

2022

[7] [7]

Large- scale parallelization and optimization of lattice qcd on tianhe new generation supercomputer,

J. Chen, C. Liu, Z. Luana, M. Gong, Q. Li, and D. Qian, “Large- scale parallelization and optimization of lattice qcd on tianhe new generation supercomputer,” in2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/Sm...

2023

[8] [8]

Mt-3000: a heterogeneous multi-zone processor for hpc,

K. Lu, Y . Wang, Y . Guo, C. Huang, S. Liu, R. Wang, J. Fang, T. Tang, Z. Chen, B. Liuet al., “Mt-3000: a heterogeneous multi-zone processor for hpc,”CCF Transactions on High Performance Computing, vol. 4, no. 2, pp. 150–164, 2022

2022

[9] [9]

Performance analysis of cuda, openacc and openmp programming models on tesla v100 gpu,

M. Khalilov and A. Timoveev, “Performance analysis of cuda, openacc and openmp programming models on tesla v100 gpu,” inJournal of Physics: Conference Series, vol. 1740, no. 1. IOP Publishing, 2021, p. 012056

2021

[10] [10]

Mpi: a standard message passing interface,

D. W. Walker and J. J. Dongarra, “Mpi: a standard message passing interface,”Supercomputer, vol. 12, pp. 56–68, 1996

1996

[11] [11]

Attention is all you need,

A. Vaswani, “Attention is all you need,”Advances in Neural Information Processing Systems, 2017

2017

[12] [12]

Bert: a review of applications in natural language processing and understanding,

M. V . Koroteev, “Bert: a review of applications in natural language processing and understanding,”arXiv preprint arXiv:2103.11943, 2021

work page arXiv 2021

[13] [13]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

2019

[14] [14]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901

[15] [15]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Qwen2 Technical Report

Q. Team, “Qwen2 technical report,”arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,”arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of Machine Learning and Systems, vol. 6, pp. 87–100, 2024

2024

[21] [21]

Sparsegpt: Massive language models can be accurately pruned in one-shot,

E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 323–10 337

2023

[22] [22]

Llm-pruner: On the structural pruning of large language models,

X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,”Advances in neural information processing systems, vol. 36, pp. 21 702–21 720, 2023

2023

[23] [23]

MiniLLM: On-Policy Distillation of Large Language Models

Y . Gu, L. Dong, F. Wei, and M. Huang, “Minillm: Knowledge distillation of large language models,”arXiv preprint arXiv:2306.08543, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Flashattention: Fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022

2022

[25] [25]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[26] [26]

Linformer: Self-Attention with Linear Complexity

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,”arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[27] [27]

Reformer: The Efficient Transformer

N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient trans- former,”arXiv preprint arXiv:2001.04451, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[28] [28]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,”arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,

Z. Liu, A. Desai, F. Liao, W. Wang, V . Xie, Z. Xu, A. Kyrillidis, and A. Shrivastava, “Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,”Advances in Neural Information Processing Systems, vol. 36, pp. 52 342–52 364, 2023

2023

[30] [30]

Efficient Streaming Language Models with Attention Sinks

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,”arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Orca: A distributed serving system for{Transformer-Based}generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for{Transformer-Based}generative models,” in16th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 22), 2022, pp. 521–538

2022

[32] [32]

Text Generation Inference,

Hugging Face, “Text Generation Inference,” https://github.com/ huggingface/text-generation-inference

[33] [33]

Flexgen: High-throughput generative inference of large language models with a single gpu,

Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. R ´e, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 31 094–31 116

2023

[34] [34]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving,” in18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 193–210

2024

[35] [35]

Efficient processing of deep neural networks: A tutorial and survey,

V . Sze, Y .-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,”Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017

2017

[36] [36]

Roofline: an insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

2009

[37] [37]

Optimizing general matrix multiplications on modern multi-core dsps,

K. Yu, X. Qi, P. Zhang, J. Fang, D. Dong, R. Wang, T. Tang, C. Huang, Y . Che, and Z. Wang, “Optimizing general matrix multiplications on modern multi-core dsps,” in2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2024, pp. 964– 975. Yao Luborn in 1998, PhD candidate with Beihang University, Beijing China. His main rese...

2024

[38] [38]

He served as the chief scientist of China National High Technology Program on high perfor- mance computing for 20 years

He is a professor with the School of Com- puter Science and Engineering, Beihang University, China. He served as the chief scientist of China National High Technology Program on high perfor- mance computing for 20 years. His research inter- ests include innovative technologies in distributed computing, high performance computing, and com- puter architectu...