VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
hub Canonical reference
SGLang: Efficient Execution of Structured Language Model Programs
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang
hub tools
citation-role summary
citation-polarity summary
roles
background 7polarities
background 7representative citing papers
Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.
A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.
AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.
Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.
CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context patch quality.
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.
Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.
KernelSight-LM simulates token-level LLM inference to predict per-kernel latencies and end-to-end metrics (TTFT, TPOT, throughput) with 12.1% and 3.8% kernel errors in cross-generation and target-measured tiers.
FlexMoE produces nested pruned subnetworks for MoE LLMs across budgets via channel importance ranking and discrete action learning, plus one mid-budget recovery fine-tune, retaining 99.8% performance at 50% expert parameter pruning.
Draft-OPD applies on-policy distillation via target-assisted generation and error replay to train speculative draft models, yielding over 5x lossless acceleration and gains over EAGLE-3 and DFlash.
The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.
REVERSE uses tool-grounded trajectories and process rewards on visual grounding, query utility, and evidence discrimination to train a 4B model that outperforms retrieval-augmented baselines on Im2GPS3k and YFCC4k.
Standard LLM inference benchmarks introduce systematic bias via GIL-induced queuing in single-process asyncio setups; a multi-process framework and NTPOT metric isolate true serving engine performance at high query rates.
Empirical 2x2 factorial study on 6 statistical datasets shows format and schema constraints in LLM-based KG construction from CSV tables produce super-additive fidelity loss up to +1.180, with mismatched pairs falling below baseline, plus release of CSVFidelity-Bench.
citing papers explorer
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
-
Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.
-
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.
-
Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy
A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.
-
QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
-
SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference
SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.
-
Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference
AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.
-
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
-
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models
TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.
-
Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs
Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.
-
CodeComp: Structural KV Cache Compression for Agentic Coding
CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context patch quality.
-
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
-
MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference
MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.
-
Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path
Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.
-
KernelSight-LM: A Kernel-Level LLM Inference Simulator
KernelSight-LM simulates token-level LLM inference to predict per-kernel latencies and end-to-end metrics (TTFT, TPOT, throughput) with 12.1% and 3.8% kernel errors in cross-generation and target-measured tiers.
-
FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models
FlexMoE produces nested pruned subnetworks for MoE LLMs across budgets via channel importance ranking and discrete action learning, plus one mid-budget recovery fine-tune, retaining 99.8% performance at 50% expert parameter pruning.
-
Draft-OPD: On-Policy Distillation for Speculative Draft Models
Draft-OPD applies on-policy distillation via target-assisted generation and error replay to train speculative draft models, yielding over 5x lossless acceleration and gains over EAGLE-3 and DFlash.
-
A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving
The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.
-
REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization
REVERSE uses tool-grounded trajectories and process rewards on visual grounding, query utility, and evidence discrimination to train a 4B model that outperforms retrieval-augmented baselines on Im2GPS3k and YFCC4k.
-
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
Standard LLM inference benchmarks introduce systematic bias via GIL-induced queuing in single-process asyncio setups; a multi-process framework and NTPOT metric isolate true serving engine performance at high query rates.
-
Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables
Empirical 2x2 factorial study on 6 statistical datasets shows format and schema constraints in LLM-based KG construction from CSV tables produce super-additive fidelity loss up to +1.180, with mismatched pairs falling below baseline, plus release of CSVFidelity-Bench.
-
The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models
Enforcing hard schemas on sub-3B models raises schema validity to 100% but drops answer accuracy from 19.7% to 11.0% and executable accuracy from 91.5% to 48.0% on tool-call tasks.
-
OpenJarvis: Personal AI, On Personal Devices
OpenJarvis decomposes personal AI into Intelligence, Engine, Agents, Tools & Memory, and Learning primitives and applies LLM-guided spec search to produce on-device configurations that reach within 3.2 pp of cloud baselines on average across eight tasks.
-
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
DualKV eliminates redundant prompt replication in RL training attention kernels via fused dual-KV CUDA operations and token repacking, delivering 1.63-3.82x policy-update speedups while remaining mathematically equivalent to standard attention.
-
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
-
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.
-
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
-
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
-
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
-
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems
ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.
-
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
-
MEMENTO: Teaching LLMs to Manage Their Own Context
MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
-
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.
-
Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference
Sutradhara co-designs orchestrator and LLM serving to overlap tool execution with prefill, stream tool dispatch during decode, and use semantic hints for cache management, yielding up to 77% higher load at fixed median FTR latency or 15% lower median FTR at fixed load.
-
CacheClip: Accelerating RAG with Effective KV Cache Reuse
CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.
-
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
FlashInfer delivers a customizable attention kernel that reduces inter-token latency by 29-69% in LLM serving benchmarks via optimized KV-cache storage and load-balanced scheduling compatible with CUDA graphs.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.
-
Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM
K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.
-
The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing
Confidential VM-GPU bridge on Blackwell GPUs serializes host-device transfers and raises setup costs, causing 13-27% LLM serving throughput loss and doubled KV-cache restore latency.
-
Stateful Inference for Low-Latency Multi-Agent Tool Calling
Stateful KV cache with radix prefix cache and prompt-lookup speculative decoder reduces per-turn cost from O(n) to O(Δ) and delivers 2.1-4.2× speedups versus vLLM and SGLang on generated multi-agent workloads.
-
Polar: Agentic RL on Any Harness at Scale
Polar provides a scalable asynchronous RL rollout framework that works with arbitrary agent harnesses via black-box proxying and trajectory reconstruction, demonstrated by performance gains on SWE-Bench Verified across multiple coding harnesses.
-
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
Proposes image-bank harness and ODE closed-loop data generation to boost multimodal deep search agents, reporting average score gains from 24.9% to 39.0% on 8 benchmarks for 8B model and 30.6% to 41.5% for 30B.
-
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.
-
Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use
A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.
-
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
-
Human-Less LLM Serving: Quantifying the Human Tax on Throughput
Measurement study finds LLM serving systems sacrifice 60-93% throughput to meet human-centric TTFT/TPOT SLOs unnecessary for programmatic long-horizon tasks.