Loongserve: Efficiently serving long-context large language models with elas- tic sequence parallelism

LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism , author= · 2024 · arXiv 2404.09526

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

cs.LG · 2026-06-16 · unverdicted · novelty 7.0

Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.

AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System

cs.DC · 2026-05-22 · unverdicted · novelty 5.0

AlignedServe uses prefix-aware batching, large CPU in-flight request pools, batch scheduling, and GPU-to-GPU KV prefetching to raise decoding throughput up to 1.98x and cut latency up to 7.4x versus prior serving systems.

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

cs.AR · 2025-12-10 · unverdicted · novelty 5.0

ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

cs.DC · 2025-05-15 · unverdicted · novelty 5.0

ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Beyond Prediction: Tail-Aware Scheduling for LLM Inference cs.LG · 2026-06-16 · unverdicted · none · ref 36
Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.
AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System cs.DC · 2026-05-22 · unverdicted · none · ref 38
AlignedServe uses prefix-aware batching, large CPU in-flight request pools, batch scheduling, and GPU-to-GPU KV prefetching to raise decoding throughput up to 1.98x and cut latency up to 7.4x versus prior serving systems.
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators cs.AR · 2025-12-10 · unverdicted · none · ref 37
ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production cs.DC · 2025-05-15 · unverdicted · none · ref 47
ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.

Loongserve: Efficiently serving long-context large language models with elas- tic sequence parallelism

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer