HMA-Serve enables efficient cross-vendor disaggregated LLM serving on memory-heterogeneous accelerators via phase-wise quantization, compute-transfer pipelining, and deferred dequantization, delivering up to 3.2x goodput and 4.8x goodput-per-dollar.
Disaggregated prefill and decoding inference system for large language model serving on multi-vendor GPUs, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
citing papers explorer
-
HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators
HMA-Serve enables efficient cross-vendor disaggregated LLM serving on memory-heterogeneous accelerators via phase-wise quantization, compute-transfer pipelining, and deferred dequantization, delivering up to 3.2x goodput and 4.8x goodput-per-dollar.
- Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving