Lambdas- cale: Enabling fast scaling for serverless large language model inference,

· 2025 · arXiv 2502.09922

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

cs.DC · 2026-04-08 · unverdicted · novelty 6.0

Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.

Janus: Disaggregating Attention and Experts for Scalable MoE Inference

cs.DC · 2025-12-15 · unverdicted · novelty 6.0

JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

cs.DC · 2025-12-10 · unverdicted · novelty 6.0

WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.

citing papers explorer

Showing 4 of 4 citing papers after filters.

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL cs.DC · 2026-05-07 · unverdicted · none · ref 87 · 2 links
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start cs.DC · 2026-04-08 · unverdicted · none · ref 52
Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.
Janus: Disaggregating Attention and Experts for Scalable MoE Inference cs.DC · 2025-12-15 · unverdicted · none · ref 34
JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving cs.DC · 2025-12-10 · unverdicted · none · ref 17
WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.

Lambdas- cale: Enabling fast scaling for serverless large language model inference,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer