Disaggregated prefill and decoding inference system for large language model serving on multi-vendor GPUs, 2025

Xing Chen, Rong Shi, Lu Zhao, Lingbin Wang, Xiao Jin, Yueqiang Chen, Hongfeng Sun · 2025 · arXiv 2509.17542

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators

cs.AR · 2026-06-29 · unverdicted · novelty 7.0

HMA-Serve enables efficient cross-vendor disaggregated LLM serving on memory-heterogeneous accelerators via phase-wise quantization, compute-transfer pipelining, and deferred dequantization, delivering up to 3.2x goodput and 4.8x goodput-per-dollar.

Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving

cs.DC · 2026-06-29

citing papers explorer

Showing 2 of 2 citing papers.

HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators cs.AR · 2026-06-29 · unverdicted · none · ref 3
HMA-Serve enables efficient cross-vendor disaggregated LLM serving on memory-heterogeneous accelerators via phase-wise quantization, compute-transfer pipelining, and deferred dequantization, delivering up to 3.2x goodput and 4.8x goodput-per-dollar.
Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving cs.DC · 2026-06-29 · unreviewed · ref 48

Disaggregated prefill and decoding inference system for large language model serving on multi-vendor GPUs, 2025

fields

years

verdicts

representative citing papers

citing papers explorer