pith. sign in

hub Canonical reference

SGLang: Efficient Execution of Structured Language Model Programs

Canonical reference. 100% of citing Pith papers cite this work as background.

76 Pith papers citing it
Background 100% of classified citations
abstract

Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang

hub tools

citation-role summary

background 7

citation-polarity summary

roles

background 7

polarities

background 7

clear filters

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

cs.LG · 2025-12-16 · conditional · novelty 8.0

Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

cs.DC · 2026-05-27 · unverdicted · novelty 7.0

SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

CodeComp: Structural KV Cache Compression for Agentic Coding

cs.CL · 2026-04-11 · unverdicted · novelty 7.0

CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context patch quality.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

Harnessing Streaming Video in the Wild

cs.CV · 2026-06-07 · unverdicted · novelty 6.0

Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.

citing papers explorer

Showing 11 of 11 citing papers after filters.

  • Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection cs.CL · 2026-03-22 · unverdicted · none · ref 7 · internal anchor

    Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.

  • CodeComp: Structural KV Cache Compression for Agentic Coding cs.CL · 2026-04-11 · unverdicted · none · ref 17 · internal anchor

    CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context patch quality.

  • Draft-OPD: On-Policy Distillation for Speculative Draft Models cs.CL · 2026-05-28 · unverdicted · none · ref 30 · internal anchor

    Draft-OPD applies on-policy distillation via target-assisted generation and error replay to train speculative draft models, yielding over 5x lossless acceleration and gains over EAGLE-3 and DFlash.

  • Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale cs.CL · 2026-04-13 · unverdicted · none · ref 18 · internal anchor

    Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.

  • Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization cs.CL · 2026-03-30 · unverdicted · none · ref 34 · internal anchor

    Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.

  • Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM cs.CL · 2026-06-28 · unverdicted · none · ref 37 · internal anchor

    K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.

  • Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents cs.CL · 2026-05-11 · unverdicted · none · ref 23 · 2 links · internal anchor

    Proposes image-bank harness and ODE closed-loop data generation to boost multimodal deep search agents, reporting average score gains from 24.9% to 39.0% on 8 benchmarks for 8B model and 30.6% to 41.5% for 30B.

  • LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation cs.CL · 2026-04-22 · unverdicted · none · ref 15 · internal anchor

    Two-stage Schema-Guided Reasoning with LLM condensation and deterministic compilation achieves macro-F1 of 0.63 on dyspnea CRF filling task and is language-agnostic.

  • Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework cs.CL · 2025-11-26 · unverdicted · none · ref 16 · internal anchor

    Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.

  • DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference cs.CL · 2025-10-22 · unverdicted · none · ref 20 · internal anchor

    DiffAdapt detects problem difficulty via entropy in reasoning traces and applies one of three fixed inference strategies per question, cutting token usage up to 22.4% with comparable or better accuracy across five models and eight benchmarks.

  • A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 222 · internal anchor

    A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.