hub Canonical reference

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu · 2023 · cs.AI · arXiv 2312.07104

Canonical reference. 100% of citing Pith papers cite this work as background.

79 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 79 citing papers arXiv PDF

abstract

Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 7

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

cs.CL · 2026-03-22 · unverdicted · novelty 8.0

Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

cs.LG · 2025-12-16 · conditional · novelty 8.0

Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

cs.LG · 2026-06-18 · unverdicted · novelty 7.0

Execution-state capsules enable graph-bound full-state checkpointing and sub-millisecond restore for LLMs including KV and recurrent states, yielding 3.9x-27x TTFT speedups in on-device physical-AI serving.

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

cs.LG · 2026-06-08 · conditional · novelty 7.0

A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Tangram makes non-uniform KV cache compression practical for LLM serving with deterministic budget allocation, head group paging, and ahead-of-time load balancing, achieving up to 2.6x throughput gains.

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

cs.DC · 2026-05-27 · unverdicted · novelty 7.0

SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.

Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

cs.PF · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.

Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

cs.AR · 2026-04-15 · unverdicted · novelty 7.0

Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.

CodeComp: Structural KV Cache Compression for Agentic Coding

cs.CL · 2026-04-11 · unverdicted · novelty 7.0

CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context patch quality.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference

cs.AR · 2025-04-14 · unverdicted · novelty 7.0

MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.

OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters

cs.DC · 2026-07-02 · unverdicted · novelty 6.0

OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.

Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.

KernelSight-LM: A Kernel-Level LLM Inference Simulator

cs.PF · 2026-06-26 · unverdicted · novelty 6.0 · 2 refs

KernelSight-LM simulates LLM inference at kernel granularity with cross-generation (12.1% per-kernel error) and target-measured (3.8% error) tiers, yielding end-to-end median errors of 15.4%/12.8%/3.0% and 14.3%/6.2%/2.7% for TTFT/TPOT/throughput across six model families.

FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

FlexMoE produces nested pruned subnetworks for MoE LLMs across budgets via channel importance ranking and discrete action learning, plus one mid-budget recovery fine-tune, retaining 99.8% performance at 50% expert parameter pruning.

Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models

cs.AI · 2026-06-19 · unverdicted · novelty 6.0

Answer Engineering uses local trajectory editing during autoregressive generation to raise protocol compliance on a clinical SSNHL benchmark from 25.1% to 83.5% and balanced accuracy from 42.0% to 80.7%.

citing papers explorer

Showing 19 of 19 citing papers after filters.

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving cs.LG · 2026-06-18 · unverdicted · none · ref 15 · internal anchor
Execution-state capsules enable graph-bound full-state checkpointing and sub-millisecond restore for LLMs including KV and recurrent states, yielding 3.9x-27x TTFT speedups in on-device physical-AI serving.
Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy cs.LG · 2026-06-08 · conditional · none · ref 37 · internal anchor
A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.
Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving cs.LG · 2026-06-04 · unverdicted · none · ref 47 · internal anchor
Tangram makes non-uniform KV cache compression practical for LLM serving with deterministic budget allocation, head group paging, and ahead-of-time load balancing, achieving up to 2.6x throughput gains.
Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference cs.LG · 2026-05-21 · unverdicted · none · ref 21 · internal anchor
AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving cs.LG · 2026-04-17 · unverdicted · none · ref 38 · internal anchor
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path cs.LG · 2026-06-28 · unverdicted · none · ref 2 · internal anchor
Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.
FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models cs.LG · 2026-06-26 · unverdicted · none · ref 19 · internal anchor
FlexMoE produces nested pruned subnetworks for MoE LLMs across budgets via channel importance ranking and discrete action learning, plus one mid-budget recovery fine-tune, retaining 99.8% performance at 50% expert parameter pruning.
RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference cs.LG · 2026-06-07 · unverdicted · none · ref 17 · internal anchor
RKSC delivers 3.008x mean speedup over baseline and 1.66x over vLLM prefix caching for multi-branch LLM reasoning via similarity-based KV sharing and confidence-gated early exit, with 0.37% error rate.
Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models cs.LG · 2026-06-07 · unverdicted · none · ref 20 · internal anchor
Sparrow uses a dynamic sparsity schedule keyed to the lower tail of sparse-to-dense actor-policy mismatch to enable stable and faster rollouts in long-context RL for LLMs.
A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving cs.LG · 2026-05-26 · unverdicted · none · ref 27 · internal anchor
The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.
The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 8 · internal anchor
Enforcing hard schemas on sub-3B models raises schema validity to 100% but drops answer accuracy from 19.7% to 11.0% and executable accuracy from 91.5% to 48.0% on tool-call tasks.
OpenJarvis: Personal AI, On Personal Devices cs.LG · 2026-05-16 · unverdicted · none · ref 97 · internal anchor
OpenJarvis decomposes personal AI into Intelligence, Engine, Agents, Tools & Memory, and Learning primitives and applies LLM-guided spec search to produce on-device configurations that reach within 3.2 pp of cloud baselines on average across eight tasks.
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts cs.LG · 2026-05-14 · unverdicted · none · ref 25 · 2 links · internal anchor
DualKV eliminates redundant prompt replication in RL training attention kernels via fused dual-KV CUDA operations and token repacking, delivering 1.63-3.82x policy-update speedups while remaining mathematically equivalent to standard attention.
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers cs.LG · 2026-05-13 · unverdicted · none · ref 15 · internal anchor
Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading cs.LG · 2026-05-07 · unverdicted · none · ref 59 · internal anchor
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts cs.LG · 2026-04-28 · unverdicted · none · ref 14 · internal anchor
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
Stateful Inference for Low-Latency Multi-Agent Tool Calling cs.LG · 2026-05-25 · unverdicted · none · ref 2 · internal anchor
Stateful KV cache with radix prefix cache and prompt-lookup speculative decoder reduces per-turn cost from O(n) to O(Δ) and delivers 2.1-4.2× speedups versus vLLM and SGLang on generated multi-agent workloads.
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment cs.LG · 2026-05-07 · unverdicted · none · ref 13 · internal anchor
Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.
UltraQuant: 4-bit KV Caching for Context-Heavy Agents cs.LG · 2026-06-18 · unverdicted · none · ref 30 · internal anchor
UltraQuant applies 4-bit KV caching with TurboQuant-style rotation and custom AMD kernels to context-heavy agent workloads, delivering 3.47x P50 TTFT reduction in cache-pressured rounds and 1.63x throughput gain over FP8.

SGLang: Efficient Execution of Structured Language Model Programs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer