arXiv preprint arXiv:2504.19516 , year=

Zejia Lin, Hongxin Xu, Guanyi Chen, Xianwei Zhang, Yutong Lu · 2025 · arXiv 2504.19516

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

cs.OS · 2026-05-05 · unverdicted · novelty 7.0

Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

cs.DC · 2026-03-12 · unverdicted · novelty 7.0

This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.

PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers

cs.DC · 2026-05-04 · unverdicted · novelty 5.0

PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.

citing papers explorer

Showing 3 of 3 citing papers.

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving cs.OS · 2026-05-05 · unverdicted · none · ref 26
Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.
Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows cs.DC · 2026-03-12 · unverdicted · none · ref 42
This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers cs.DC · 2026-05-04 · unverdicted · none · ref 23
PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.

arXiv preprint arXiv:2504.19516 , year=

fields

years

verdicts

representative citing papers

citing papers explorer