WarmServe: Enabling one-for-many GPU prewarming for multi-LLM serving.arXiv preprint arXiv:2512.09472, 2025

Chiheng Lou, Sheng Qi, Rui Kang, et al · 2025 · arXiv 2512.09472

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

cs.DC · 2026-04-08 · unverdicted · novelty 6.0

Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

cs.LG · 2026-03-22 · unverdicted · novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

citing papers explorer

Showing 2 of 2 citing papers.

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start cs.DC · 2026-04-08 · unverdicted · none · ref 31
Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project cs.LG · 2026-03-22 · unverdicted · none · ref 94
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

WarmServe: Enabling one-for-many GPU prewarming for multi-LLM serving.arXiv preprint arXiv:2512.09472, 2025

fields

years

verdicts

representative citing papers

citing papers explorer